Manticore Search is a multi-storage database specifically designed for search, with robust full-text search capabilities.
As an open-source database (available on GitHub), Manticore Search was created in 2017 as a continuation of Sphinx Search engine. Our development team took all the best features of Sphinx and significantly improved its functionality, fixing hundreds of bugs along the way (as detailed in our Changelog). With nearly complete code rewrites, Manticore Search is now a modern, fast, and light-weight database with full features and exceptional full-text search capabilities.
Manticore Search supports the ability to add embeddings generated by your Machine Learning models to each document, and then doing a nearest-neighbor search on them. This lets you build features like similarity search, recommendations, semantic search, and relevance ranking based on NLP algorithms, among others, including image, video, and sound searches.
Manticore Search utilizes a smart query parallelization to lower response time and fully utilize all CPU cores when needed.
The cost-based query optimizer uses statistical data about the indexed data to evaluate the relative costs of different execution plans for a given query. This allows the optimizer to determine the most efficient plan for retrieving the desired results, taking into account factors such as the size of the indexed data, the complexity of the query, and the available resources.
Manticore offers both row-wise and column-oriented storage options to accommodate datasets of various sizes. The traditional and default row-wise storage option is available for datasets of all sizes - small, medium, and large, while the columnar storage option is provided through the Manticore Columnar Library for even larger datasets. The key difference between these storage options is that row-wise storage requires all attributes (excluding full-text fields) to be kept in RAM for optimal performance, while columnar storage does not, thus offering lower RAM consumption, but with a potential for slightly slower performance (as demonstrated by the statistics on https://db-benchmarks.com/).
Manticore Columnar Library uses Piecewise Geometric Model index, which exploits a learned mapping between the indexed keys and their location in memory. The succinctness of this mapping, coupled with a peculiar recursive construction algorithm, makes the PGM-index a data structure that dominates traditional indexes by orders of magnitude in space while still offering the best query and update time performance. Secondary indexes are ON by default for all numeric fields.
Manticore's native syntax is SQL and it supports SQL over HTTP and MySQL protocol, allowing for connection through popular mysql clients in any programming language.
For a more programmatic approach to managing data and schemas, Manticore provides HTTP JSON protocol, similar to that of Elasticsearch.
You can execute Elasticsearch-compatible insert and replace JSON queries which enables using Manticore with tools like Logstash (version < 7.13), Filebeat and other tools from the Beats family.
Easily create, update, and delete tables online or through a configuration file.
The Manticore Search daemon is developed in C++, offering fast start times and efficient memory utilization. The utilization of low-level optimizations further boosts performance. Another crucial component, called Manticore Buddy, is written in PHP and is utilized for high-level functionality that does not require lightning-fast response times or extremely high processing power. Although contributing to the C++ code may pose a challenge, adding a new SQL/JSON command using Manticore Buddy should be a straightforward process.
Newly added or updated documents can be immediately read.
We offer free interactive courses to make learning effortless.
While Manticore is not fully ACID-compliant, it supports isolated transactions for atomic changes and binary logging for safe writes.
Data can be distributed across servers and data centers with any Manticore Search node acting as both a load balancer and a data node. Manticore implements virtually synchronous multi-master replication using the Galera library, ensuring data consistency across all nodes, preventing data loss, and providing exceptional replication performance.
Manticore is equipped with an external tool manticore-backup, and the BACKUP SQL command to simplify the process of backing up and restoring your data. Alternatively, you can use mysqldump to make logical backups.
The indexer tool and comprehensive configuration syntax of Manticore make it easy to sync data from sources like MySQL, PostgreSQL, ODBC-compatible databases, XML, and CSV.
You can integrate Manticore Search with a MySQL/MariaDB server using the FEDERATED engine or via ProxySQL.
You can use Apache Superset and Grafana to visualize data stored in Manticore. Various MySQL tools can be used to develop Manticore queries interactively, such as HeidiSQL and DBForge.
Manticore offers a special table type, the "percolate" table, which allows you to search queries instead of data, making it an efficient tool for filtering full-text data streams. Simply store your queries in the table, process your data stream by sending each batch of documents to Manticore Search, and receive only the results that match your stored queries.
Manticore has a variety of use cases, including:
The manual is arranged to reflect the most likely way you would use Manticore:
Key sections of the manual are marked with 1️⃣, 2️⃣, 3️⃣ etc. in the menu for your convenience since their corresponding functionality is most used. If you are new to Manticore we highly recommend not skipping them.
If you are looking for a quick understanding of how Manticore works in general ⚡ Quick start guide is a good place to start.
Each query example has a little icon 📋 in the top-right corner:

You can use it to copy examples to the clipboard. If the query is an HTTP request it will be copied as a CURL command. You can configure the host/port if you press ⚙️.
We love search and we've made our best to make searching in this manual as convenient as possible. Of course it's backed by Manticore Search. Besides using the search bar which requires opening the manual first there is a very easy way to find something by just opening mnt.cr/your-search-keyword :

There are few things you need to understand about Manticore Search that can help you follow the best practices of using it.
Manticore Search works in two modes:
CREATE/ALTER/DROP TABLE and their equivalents in non-SQL clientsYou cannot combine the 2 modes and need to decide which one you want to follow by specifying data_dir in your configuration file (which is the default behaviour). If you are unsure our recommendation is to follow the RT mode as if even you need a plain table you can build it with a separate plain table config and import to your main Manticore instance.
Real-time tables can be used in both RT and plain modes. In the RT mode a real-time table is defined with a CREATE TABLE command, while in the plain mode it is defined in the configuration file. Plain (offline) tables are supported only in the plain mode. Plain tables cannot be created in the RT mode, but existing plain tables made in the plain mode can be converted to real-time tables and imported in the RT mode.
Manticore provides multiple ways and interfaces to manage your schemas and data, but the two main are:
sudo yum install https://repo.manticoresearch.com/manticore-repo.noarch.rpm
sudo yum install manticore manticore-extra
sudo yum --setopt=tsflags=noscripts remove manticore*
wget https://repo.manticoresearch.com/manticore-repo.noarch.deb
sudo dpkg -i manticore-repo.noarch.deb
sudo apt update
sudo apt install manticore manticore-extra
sudo apt remove manticore*
brew install manticoresoftware/tap/manticoresearch manticoresoftware/tap/manticore-extra
docker run -e EXTRA=1 --name manticore --rm -d manticoresearch/manticore && echo "Waiting for Manticore docker to start. Consider mapping the data_dir to make it start faster next time" && until docker logs manticore 2>&1 | grep -q "accepting connections"; do sleep 1; echo -n .; done && echo && docker exec -it manticore mysql && docker stop manticore
docker run -e EXTRA=1 --name manticore -v $(pwd)/data:/var/lib/manticore -p 127.0.0.1:9306:9306 -p 127.0.0.1:9308:9308 -d manticoresearch/manticore
Docker images of Manticore Search are publicly accessible on Docker Hub, built from the Manticore Search docker GitHub repository.
To retrieve the Manticore image, run the following command:
docker pull manticoresearch/manticore
For more information about using Manticore in Docker, see the Using Manticore in Docker section.
The simplest method to install Manticore on RedHat/CentOS is by using our YUM repository:
Install the repository:
sudo yum install https://repo.manticoresearch.com/manticore-repo.noarch.rpm
Then install Manticore Search:
sudo yum install manticore manticore-extra
If you are upgrading to Manticore 6 from an older version, it is recommended to remove your old packages first to avoid conflicts caused by the updated package structure:
sudo yum remove manticore*
It won't remove your data and configuration file.
If you prefer "Nightly" (development) versions do:
sudo yum -y install https://repo.manticoresearch.com/manticore-repo.noarch.rpm && \
sudo yum -y --enablerepo manticore-dev install manticore manticore-extra manticore-common manticore-server manticore-server-core manticore-tools manticore-executor manticore-buddy manticore-backup manticore-columnar-lib manticore-server-core-debuginfo manticore-tools-debuginfo manticore-columnar-lib-debuginfo manticore-icudata manticore-galera manticore-galera-debuginfo
To download standalone RPM files from the Manticore repository, follow the instructions available at https://manticoresearch.com/install/.
If you plan to use indexer to create tables from external sources, you'll need to make sure you have installed corresponding client libraries in order to make available of indexing sources you want. The line below will install all of them at once; feel free to use it as is, or to reduce it to install only libraries you need (for only mysql sources - just mysql-libs should be enough, and unixODBC is not necessary).
sudo yum install mysql-libs postgresql-libs expat unixODBC
In CentOS Stream 8 you may need to run:
dnf install mariadb-connector-c
if you get error sql_connect: MySQL source wasn't initialized. Wrong name in dlopen? trying to build a plain table from MySQL.
The lemmatizer requires Python 3.9+. Make sure you have it installed and that it's configured with --enable-shared.
Here's how to install Python 3.9 and the Ukrainian lemmatizer in Centos 7/8:
# install Manticore Search and UK lemmatizer from YUM repository
yum -y install https://repo.manticoresearch.com/manticore-repo.noarch.rpm
yum -y install manticore manticore-lemmatizer-uk
# install packages needed for building Python
yum groupinstall "Development Tools" -y
yum install openssl-devel libffi-devel bzip2-devel wget -y
# download, build and install Python 3.9
cd ~
wget https://www.python.org/ftp/python/3.9.2/Python-3.9.2.tgz
tar xvf Python-3.9.2.tgz
cd Python-3.9*/
./configure --enable-optimizations --enable-shared
make -j8 altinstall
# update linker cache
ldconfig
# install pymorphy2 and UK dictionary
pip3.9 install pymorphy2[fast]
pip3.9 install pymorphy2-dicts-uk
12.0 (Bookworm)
Ubuntu
22.04 (Ubuntu Jammy)
Mint
The easiest way to install Manticore in Ubuntu/Debian/Mint is by using our APT repository.
Install the repository:
wget https://repo.manticoresearch.com/manticore-repo.noarch.deb
sudo dpkg -i manticore-repo.noarch.deb
sudo apt update
(install wget if it's not installed; install gnupg2 if apt-key fails).
Then install Manticore Search:
sudo apt install manticore manticore-extra
If you are upgrading to Manticore 6 from an older version, it is recommended to remove your old packages first to avoid conflicts caused by the updated package structure:
sudo apt remove manticore*
It won't remove your data and configuration file.
If you prefer "Nightly" (development) versions do:
wget https://repo.manticoresearch.com/manticore-dev-repo.noarch.deb && \
sudo dpkg -i manticore-dev-repo.noarch.deb && \
sudo apt -y update && \
sudo apt -y install manticore manticore-extra manticore-common manticore-server manticore-server-core manticore-tools manticore-executor manticore-buddy manticore-backup manticore-columnar-lib manticore-server-core-dbgsym manticore-tools-dbgsym manticore-columnar-lib-dbgsym manticore-icudata-65l manticore-galera manticore-galera-dbgsym
To download standalone DEB files from the Manticore repository, follow the instructions available at https://manticoresearch.com/install/.
Manticore package depends on zlib and ssl libraries, nothing else is strictly required. However, if you plan to use indexer to create tables from external storages, you'll need to install appropriate client libraries. To find out what specific libraries indexer requires, run it and look at the top of its output:
$ sudo -u manticore indexer
Manticore 3.5.4 13f8d08d@201211 release
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2024, Manticore Software LTD (https://manticoresearch.com)
Built by gcc/clang v 5.4.0,
Built on Linux runner-0277ea0f-project-3858465-concurrent-0 4.19.78-coreos #1 SMP Mon Oct 14 22:56:39 -00 2019 x86_64 x86_64 x86_64 GNU/Linux
Configured by CMake with these definitions: -DCMAKE_BUILD_TYPE=RelWithDebInfo -DDISTR_BUILD=xenial -DUSE_SSL=ON -DDL_UNIXODBC=1 -DUNIXODBC_LIB=libodbc.so.2 -DDL_EXPAT=1 -DEXPAT_LIB=libexpat.so.1 -DUSE_LIBICONV=1 -DDL_MYSQL=1 -DMYSQL_LIB=libmysqlclient.so.20 -DDL_PGSQL=1 -DPGSQL_LIB=libpq.so.5 -DLOCALDATADIR=/var/data -DFULL_SHARE_DIR=/usr/share/manticore -DUSE_ICU=1 -DUSE_BISON=ON -DUSE_FLEX=ON -DUSE_SYSLOG=1 -DWITH_EXPAT=1 -DWITH_ICONV=ON -DWITH_MYSQL=1 -DWITH_ODBC=ON -DWITH_POSTGRESQL=1 -DWITH_RE2=1 -DWITH_STEMMER=1 -DWITH_ZLIB=ON -DGALERA_SOVERSION=31 -DSYSCONFDIR=/etc/manticoresearch
Here you can see mentions of libodbc.so.2, libexpat.so.1, libmysqlclient.so.20, and libpq.so.5.
Below is a reference table with a list of all the client libraries for different Debian/Ubuntu versions:
| Distr | MySQL | PostgreSQL | XMLpipe | UnixODBC |
|---|---|---|---|---|
| Ubuntu Trusty | libmysqlclient.so.18 | libpq.so.5 | libexpat.so.1 | libodbc.so.1 |
| Ubuntu Bionic | libmysqlclient.so.20 | libpq.so.5 | libexpat.so.1 | libodbc.so.2 |
| Ubuntu Focal | libmysqlclient.so.21 | libpq.so.5 | libexpat.so.1 | libodbc.so.2 |
| Ubuntu Hirsute | libmysqlclient.so.21 | libpq.so.5 | libexpat.so.1 | libodbc.so.2 |
| Ubuntu Jammy | libmysqlclient.so.21 | libpq.so.5 | libexpat.so.1 | libodbc.so.2 |
| Debian Jessie | libmysqlclient.so.18 | libpq.so.5 | libexpat.so.1 | libodbc.so.2 |
| Debian Buster | libmariadb.so.3 | libpq.so.5 | libexpat.so.1 | libodbc.so.2 |
| Debian Bullseye | libmariadb.so.3 | libpq.so.5 | libexpat.so.1 | libodbc.so.2 |
| Debian Bookworm | libmariadb.so.3 | libpq.so.5 | libexpat.so.1 | libodbc.so.2 |
To find packages that provide the libraries, you can use, for example, apt-file:
apt-file find libmysqlclient.so.20
libmysqlclient20: /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20
libmysqlclient20: /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20.2.0
libmysqlclient20: /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20.3.6
Note that you only need libraries for the types of storages you're going to use. So if you plan to build tables only from MySQL, then you might need to install only the MySQL library (in the above case libmysqlclient20).
Finally, install the needed packages:
sudo apt-get install libmysqlclient20 libodbc1 libpq5 libexpat1
If you aren't going to use the indexer tool at all, you don't need to find and install any libraries.
To enable CJK tokenization support, the official packages contain binaries with embedded ICU library and include ICU data file. They are independent from any ICU runtime library which might be available on your system, and can't be upgraded.
The lemmatizer requires Python 3.9+. Make sure you have it installed and that it's configured with --enable-shared.
Here's how to install Python 3.9 and the Ukrainian lemmatizer on Debian and Ubuntu:
# install Manticore Search and UK lemmatizer from APT repository
cd ~
wget https://repo.manticoresearch.com/manticore-repo.noarch.deb
sudo dpkg -i manticore-repo.noarch.deb
sudo apt -y update
sudo apt -y install manticore manticore-lemmatizer-uk
# install packages needed for building Python
sudo apt -y update
sudo apt -y install wget build-essential libreadline-dev libncursesw5-dev libssl-dev libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev libffi-dev zlib1g-dev
# download, build and install Python 3.9
cd ~
wget https://www.python.org/ftp/python/3.9.4/Python-3.9.4.tgz
tar xzf Python-3.9.4.tgz
cd Python-3.9.4
./configure --enable-optimizations --enable-shared
sudo make -j8 altinstall
# update linker cache
sudo ldconfig
# install pymorphy2 and UK dictionary
sudo pip3.9 install pymorphy2[fast]
sudo pip3.9 install pymorphy2-dicts-uk
brew install manticoresoftware/tap/manticoresearch manticoresoftware/tap/manticore-extra
Start Manticore as a brew service:
brew services start manticoresearch
The default configuration file for Manticore is located at either /usr/local/etc/manticoresearch/manticore.conf or /opt/homebrew/etc/manticoresearch/manticore.conf.
If you plan to use indexer to fetch data from sources such as MySQL, PostgreSQL, or another database using ODBC, you may need additional libraries, such as mysql@5.7, libpq, and unixodbc, respectively.
If you prefer "Nightly" (development) versions do:
brew tap manticoresoftware/tap-dev
brew install manticoresoftware/tap-dev/manticoresearch-dev manticoresoftware/tap-dev/manticore-extra-dev
brew services start manticoresearch-dev
manticore.conf file in RT mode. No additional configuration is required.To install searchd (Manticore Search server) as a Windows service, run:
\path\to\searchd.exe --install --config \path\to\config --servicename Manticore
Make sure to use the full path of the configuration file, otherwise searchd.exe will not be able to locate it when it starts as a service.
After installation, the service can be started from the Services snap-in of the Microsoft Management Console.
Once started, you can access Manticore using the MySQL command line interface:
mysql -P9306 -h127.0.0.1
Note that in most examples in this manual, we use -h0 to connect to the local host, but in Windows, you must use localhost or 127.0.0.1 explicitly.
Compiling Manticore Search from sources enables custom build configurations, such as disabling certain features or adding new patches for testing. For example, you may want to compile from sources and disable the embedded ICU in order to use a different version installed on your system that can be upgraded independently of Manticore. This is also useful if you are interested in contributing to the Manticore Search project.
To prepare official release and development packages, we use Docker and a special building image. This image includes essential tooling and is designed to be used with external sysroots, so one container can build packages for all operating systems. You can build the image using the Dockerfile and README or use an image from Docker Hub. This is the easiest way to create binaries for any supported operating system and architecture. You'll also need to specify the following environment variables when running the container:
DISTR: the target platformarch: the architectureSYSROOT_URL: the URL to the system roots archives. You can use https://repo.manticoresearch.com/repository/sysroots unless you are building the sysroots yourself (instructions can be found here).To find possible values for DISTR and arch, you can use the directory https://repo.manticoresearch.com/repository/sysroots/roots_with_zstd/ as a reference, as it includes sysroots for all supported combinations.
After that, building packages inside the Docker container is as easy as calling:
cmake -DPACK=1 /path/to/sources
cmake --build .
For instance, to create a package for RHEL7-compatible operating systems that is similar to the official version Manticore Core Team provides, you should execute the following commands in the directory containing the Manticore Search sources. This directory is the root of a cloned repository from https://github.com/manticoresoftware/manticoresearch:
docker run -it --rm -e SYSROOT_URL=https://repo.manticoresearch.com/repository/sysroots \
-e arch=x86_64 \
-e DISTR=rhel7 \
-e boost=boost_rhel_feb17 \
-e sysroot=roots_nov22 \
-v $(pwd):/manticore_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa \
manticoresearch/external_toolchain:clang16_cmake3263 bash
# following is to be run inside docker shell
cd /manticore_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/
mkdir build && cd build
cmake -DPACK=1 ..
cmake --build .
# or if you want to build packages:
# cmake --build . --target package
The long source directory path is required or it may fail to build the sources.
The same process can be used to build binaries/packages not only for popular Linux distributions, but also for FreeBSD, Windows, and macOS.
Compiling Manticore without using the building Docker is not recommended, but if you need to do it, here's what you may need to know:
xcode-select --install to install).Manticore source code is hosted on GitHub.
To obtain the source code, clone the repository and then check out the desired branch or tag. The branch master represents the main development branch. Upon release, a versioned tag is created, such as 3.6.0 and a new branch for the current release is started, in this case manticore-3.6.0. The head of the versioned branch after all changes is used as source to build all binary releases. For example, to take sources of version 3.6.0 you can run:
git clone https://github.com/manticoresoftware/manticoresearch.git
cd manticoresearch
git checkout manticore-3.6.0
You can download the desired code from GitHub by using the "Download ZIP" button. Both .zip and .tar.gz formats are suitable.
wget -c https://github.com/manticoresoftware/manticoresearch/archive/refs/tags/3.6.0.tar.gz
tar -zxf 3.6.0.tar.gz
cd manticoresearch-3.6.0
Manticore uses CMake. Assuming you are inside the root directory of the cloned repository:
mkdir build && cd build
cmake ..
CMake will investigate available features and configure the build according to them. By default, all features are considered enabled if they are available. The script also downloads and builds some external libraries, assuming that you want to use them. Implicitly, you get support for the maximal number of features.
You can also configure the build explicitly with flags and options. To enable feature FOO add -DFOO=1 to the CMake call.
To disable it, use -DFOO=0. If not explicitly noted, enabling a feature that is not available((such as WITH_GALERA on an MS Windows build)) will cause the configuration to fail with an error. Disabling a feature, apart from excluding it from the build, also disables its investigation on the system and disables the downloading/building of any related external libraries.
syslog in query logging.RE2 library in your system.libstemmer library in your system.icu_chinese is in use.icu library in your system. Also includes the ICU data file into the installation/distribution. The purpose of a statically linked ICU is to have a library of a known version, so that behavior is determined and not dependent on any system libraries. You will most likely prefer to use the system ICU instead, as it may be updated over time without the need to recompile the Manticore daemon. In this case, you need to explicitly disable this option. This will also save you some space occupied by the ICU data file (about 30M), as it will not be included in the distribution.MSSQL also implies this flag.ODBC_LIB with the proper path to an alternative library before running the indexer.indexer tool even if you want to process something not related to xmlpipe. This option asks the indexer to load the library at runtime only when you want to deal with xmlpipe source.indexer tool even if you want to process something not related to xmlpipe. This option asks the indexer to load the library at runtime only when you want to deal with xmlpipe source.ICONV_LIB with the proper path to an alternative library before running the indexer.indexer tool even if you want to process something not related to MySQL. This option asks the indexer to load the library at runtime only when you want to deal with a MySQL source.MYSQL_LIB with the proper path to an alternative library before running the indexer.indexer ool even if you want to process something not related to PostgreSQL. This option asks the indexer to load the library at runtime only when you want to deal with a PostgreSQL source.POSTGRESQL_LIB with the proper path to an alternative library before running the indexer.manticore.conf, which is not related to this build configuration), binlogs will be placed in this path. It is typically an absolute path, however, it is not required to be and relative paths can also be used. You probably would not need to change the default value defined by the configuration, which, depending on the target system, might be something like /var/data, /var/lib/manticore/data, or /usr/local/var/lib/manticore/data.FULL_SHARE_DIR before starting any tool that utilizes files from that folder. This is an important path as many things are expected to be found there by default. These include predefined charset tables, stopwords, manticore modules, and icu data files, all placed in that folder. The configuration script usually determines this path to be something like /usr/share/manticore, or /usr/local/share/manticore.DISTR environment variable, assigns it to the DISTR_BUILD parameter, and then works as usual. This is very useful when building in prepared build systems, like Docker containers, where the DISTR variable is set at the system level and reflects the target system for which the container is intended.cmake --install command or create a package and then install it. The prefix can be changed at any time, even during installation, by invokingcmake --install . --prefix /path/to/installation. However, at config time, this variable is used to initialize the default values of LOCALDATADIR and FULL_SHARE_DIR. For example, setting it to /my/custom at configureLOCALDATADIR as /my/custom/var/lib/manticore/data, and FULL_SHARE_DIR as/my/custom/usr/share/manticore.libstemmer_c.tgz) in this folder. Next time you want to build from scratch, the configuration script will first look up in the bundle, and if it finds the stemmer there, it will not download it again from the Internet.Note, that some options are organized in triples: WITH_XXX, DL_XXX and XXX_LIB - like support of mysql, odbc, etc. WITH_XXX determines whether next two have an effect or not. I.e., if you set WITH_ODBC to 0 - there is no sence to provide DL_ODBC and ODBC_LIB, and these two will have no effect if the whole feature is disabled. Also, XXX_LIB has no sense without DL_XXX, because if you don't want DL_XXX option, dynamic loading will not be used, and name provided by XXX_LIB is useless. That is used by default introspection.
Also, using the iconv library assumes expat and is useless if the last is disabled.
Also, some libraries may be always available, and so, there is no sense to avoid linkage with them. For example, in Windows that is ODBC. On macOS that is Expat, iconv, and m.b. others. Default introspection determines such libraries and effectively emits only WITH_XXX for them, without DL_XXX and XXX_LIB, that makes the things simpler.
With some options in game configuring might look like:
mkdir build && cd build
cmake -DWITH_MYSQL=1 -DWITH_RE2=1 ..
Apart general configuration values, you may also investigate file CMakeCache.txt which is left in build folder right after you run configuration. Any values defined there might be redefined explicitly when running cmake. For example, you may run cmake -DHAVE_GETADDRINFO_A=FALSE ..., and that config run will not assume investigated value of that variable, but will use one you've provided.
Environment variables are useful for providing some kind of global settings which are stored aside from build configuration and are always present. For persistence, they may be set globally on the system using different ways - like adding them to the .bashrc file, or embedding them into a Dockerfile if you produce a docker-based build system, or writing them in system preferences environment variables on Windows. Also, you may set them short-lived using export VAR=value in the shell. Or even shorter, by prepending values to the cmake call, like CACHEB=/my/cache cmake ... - this way it will only work on this call and will not be visible on the next.
Some of such variables are known to be used in general by cmake and some other tools. That is things like CXX which determines the current C++ compiler, or CXX_FLAGS to provide compiler flags, etc.
However, we have some variables that are specific to manticore configuration, which are invented solely for our builds.
DISTR_BUILD option when -DPACK=1 is used.WRITEB=1 cmake ... - it will not find the stemmer's sources in the bundle and will then download them from the vendor's site to the bundle (without WRITEB it will download them into a temporary folder inside the build and will disappear when you wipe the build folder).At the end of configuration, you may see what is available and will be used in a list like this one:
-- Enabled features compiled in:
* Galera, replication of tables
* re2, a regular expression library
* stemmer, stemming library (Snowball)
* icu, International Components for Unicode
* OpenSSL, for encrypted networking
* ZLIB, for compressed data and networking
* ODBC, for indexing MSSQL (windows) and generic ODBC sources with indexer
* EXPAT, for indexing xmlpipe sources with indexer
* Iconv, for support of different encodings when indexing xmlpipe sources with indexer
* MySQL, for indexing MySQL sources with indexer
* PostgreSQL, for indexing PostgreSQL sources with indexer
cmake --build . --config RelWithDebInfo
To install run:
cmake --install . --config RelWithDebInfo
to install into custom (non-default) folder, run
cmake --install . --prefix path/to/build --config RelWithDebInfo
For building a package, use the target package. It will build the package according to the selection provided by the -DDISTR_BUILD option. By default, it will be a simple .zip or .tgz archive with all binaries and supplementary files.
cmake --build . --target package --config RelWithDebInfo
If you haven't changed the path for sources and build, simply move to your build folder and run:
cmake .
cmake --build . --clean-first --config RelWithDebInfo
If by any reason it doesn't work, you can delete file CMakeCache.txt located in the build folder. After this step you
have to run cmake again, pointing to the source folder and configuring the options.
If it also doesn't help, just wipe out your build folder and begin from scratch.
Briefly - just use --config RelWithDebInfo as written above. It will make no mistake.
We use two build types. For development, it is Debug - it assigns compiler flags for optimization and other things in a way that it is very friendly for development, meaning the debug runs with step-by-step execution. However, the produced binaries are quite large and slow for production.
For releasing, we use another type - RelWithDebInfo - which means 'release build with debug info'. It produces production binaries with embedded debug info. The latter is then split away into separate debuginfo packages which are stored aside with release packages and might be used in case of some issues like crashes - for investigation and bugfixing. Cmake also provides Release and MinSizeRel, but we don't use them. If the build type is not available, cmake will make a noconfig build.
There are two types of generators: single-config and multi-config.
CMAKE_BUILD_TYPE parameter. If it is not defined, the build will fall back to the RelWithDebInfo type which is suitable if you just want to build Manticore from sources and not participate in development. For explicit builds, you should provide a build type, like -DCMAKE_BUILD_TYPE=Debug.--config option, otherwise it will build a kind of noconfig, which is not desirable. So, you should always specify the build type, like --config Debug.If you want to specify the build type but don't want to care about whether it is a 'single' or 'multi' config generator - just provide the necessary keys in both places. I.e., configure with -DCMAKE_BUILD_TYPE=Debug, and then build with --config Debug. Just be sure that both values are the same. If the target builder is a single-config, it will consume the configuration param. If it is multi-config, the configuration param will be ignored, but the correct build configuration will be selected by the --config key.
If you want RelWithDebInfo (i.e. just build for production) and know you're on a single-config platform (that is all, except Windows) - you can omit the --config flag on the cmake invocation. The default CMAKE_BUILD_TYPE=RelWithDebInfo will be configured then, and used. All the commands for 'building', 'installation' and 'building package' will become shorter then.
Cmake is the tool that doesn't perform building by itself, but it generates rules for the local build system.
Usually, it determines the available build system well, but sometimes you might need to provide a generator explicitly. You
can run cmake -G and review the list of available generators.
cmake -G "Visual Studio 16 2019" ....
```
- On all other platforms - usually Unix Makefiles are used, but you can specify another one, such as Ninja, or Ninja Multi-Config, as:
Multi-Config`, as:
```bash
cmake -GNinja ...
```
or
```bash
cmake -G"Ninja Multi-Config" ...
Ninja Multi-Config is quite useful as it is really 'multi-config' and available on Linux/macOS/BSD. With this generator, you may shift the choosing of configuration type to build time, and also you may build several configurations in one and the same build folder, changing only the --config param.
/manticore012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789, for example. That is because RPM tools modify the path over compiled binaries when building debug info, and it can just write over existing room and won't allocate more. The aforementioned long path has 100 chars and that is quite enough for such a case.Some libraries should be available if you want to use them.
- For indexing (indexer tool): expat, iconv, mysql, odbc, postgresql. Without them, you can only process tsv and csv sources.
- For serving queries (searchd daemon): openssl might be necessary.
- For all (required, mandatory!) we need the Boost library. The minimal version is 1.61.0, however, we build the binaries with a fresher version 1.75.0. Even more recent versions (like 1.76) should also be okay. On Windows, you can download pre-built Boost from their site (boost.org) and install it into the default suggested path (i.e. C:\\boost...). On MacOs, the one provided in brew is okay. On Linux, you can check the available version in official repositories, and if it doesn't match requirements, you can build from sources. We need the component 'context', you can also build components 'system' and 'program_options', they will be necessary if you also want to build Galera library from the sources. Look into dist/build_dockers/xxx/boost_175/Dockerfile for a short self-documented script/instruction on how to do it.
On the build system, you need the 'dev' or 'devel' versions of these packages installed (i.e. - libmysqlclient-devel, unixodbc-devel, etc. Look to our dockerfiles for the names of concrete packages).
On run systems, these packages should be present at least in the final (non-dev) variants. (devel variants usually larger, as they include not only target binaries, but also different development stuff like include headers, etc.).
Apart from necessary prerequisites, you might need prebuilt expat, iconv, mysql, and postgresql client libraries. You have to either build them yourself or contact us to get our build bundle (a simple zip archive where the folder with these targets is located).
Run indexer -h. It will show which features were configured and built (whether they're explicit or investigated, doesn't matter):
Built on Linux x86_64 by GNU 8.3.1 compiler.
Configured with these definitions: -DDISTR_BUILD=rhel8 -DUSE_SYSLOG=1 -DWITH_GALERA=1 -DWITH_RE2=1 -DWITH_RE2_FORCE_STATIC=1
-DWITH_STEMMER=1 -DWITH_STEMMER_FORCE_STATIC=1 -DWITH_ICU=1 -DWITH_ICU_FORCE_STATIC=1 -DWITH_SSL=1 -DWITH_ZLIB=1 -DWITH_ODBC=1 -DDL_ODBC=1
-DODBC_LIB=libodbc.so.2 -DWITH_EXPAT=1 -DDL_EXPAT=1 -DEXPAT_LIB=libexpat.so.1 -DWITH_ICONV=1 -DWITH_MYSQL=1 -DDL_MYSQL=1
-DMYSQL_LIB=libmariadb.so.3 -DWITH_POSTGRESQL=1 -DDL_POSTGRESQL=1 -DPOSTGRESQL_LIB=libpq.so.5 -DLOCALDATADIR=/var/lib/manticore/data
-DFULL_SHARE_DIR=/usr/share/manticore
Manticore Search 2.x maintains compatibility with Sphinxsearch 2.x and can load existing tables created by Sphinxsearch. In most cases, upgrading is just a matter of replacing the binaries.
Instead of sphinx.conf (in Linux normally located at /etc/sphinxsearch/sphinx.conf) Manticore by default uses /etc/manticoresearch/manticore.conf. It also runs under a different user and use different folders.
Systemd service name has changed from sphinx/sphinxsearch to manticore and the service runs under user manticore (Sphinx was using sphinx or sphinxsearch). It also uses a different folder for the PID file.
The folders used by default are /var/lib/manticore, /var/log/manticore, /var/run/manticore. You can still use the existing Sphinx config, but you need to manually change permissions for /var/lib/sphinxsearch and /var/log/sphinxsearch folders. Or, just rename globally 'sphinx' to 'manticore' in system files. If you use other folders (for data, wordforms files etc.) the ownership must be also switched to user manticore. The pid_file location should be changed to match the manticore.service to /var/run/manticore/searchd.pid.
If you want to use the Manticore folder instead, the table files need to be moved to the new data folder (/var/lib/manticore) and the permissions must be changed to user manticore.
Upgrading from Sphinx / Manticore 2.x to 3.x is not straightforward, as the table storage engine has undergone a significant upgrade and the new searchd cannot load older tables and upgrade them to the new format on-the-fly.
Manticore Search 3 got a redesigned table storage. Tables created with Manticore/Sphinx 2.x cannot be loaded by Manticore Search 3 without a conversion. Because of the 4GB limitation, a real-time table in 2.x could still have several disk chunks after an optimize operation. After upgrading to 3.x, these tables can now be optimized to 1-disk chunk with the usual OPTIMIZE command. Index files also changed. The only component that didn't get any structural changes is the .spp file (hitlists). .sps (strings/json) and .spm (MVA) are now held by .spb (var-length attributes). The new format has an .spm file present, but it's used for row map (previously it was dedicated for MVA attributes). The new extensions added are .spt (docid lookup), .sphi ( secondary index histograms), .spds (document storage). In case you are using scripts that manipulate table files, they should be adapted for the new file extensions.
The upgrade procedure may differ depending on your setup (number of servers in the cluster, whether you have high availability or not, etc.), but in general, it involves creating new 3.x table versions and replacing your existing ones, as well as replacing older 2.x binaries with the new ones.
There are two special requirements to take care:
Manticore Search 3 includes a new tool - index_converter - that can convert Sphinx 2.x / Manticore 2.x tables to 3.x format. index_converter comes in a separate package which should be installed first. Using the convert tool create 3.x versions of your tables. index_converter can write the new files in the existing data folder and backup the old files or it can write the new files to a chosen folder.
If you have a single server:
--output-dir option)To minimize downtime, you can copy 2.x tables, config (you'll need to edit paths here for tables, logs, and different ports), and binaries to a separate location and start this on a separate port. Point your application to it. After upgrading to 3.0 and the new server is started, you can point the application back to the normal ports. If everything is good, stop the 2.x copy and delete the files to free up space.
If you have a spare box (like a testing or staging server), you can do the table upgrade there first and even install Manticore 3 to perform several tests. If everything is okay, copy the new table files to the production server. If you have multiple servers that can be pulled out of production, do it one by one and perform the upgrade on each. For distributed setups, 2.x searchd can work as a master with 3.x nodes, so you can do the upgrading on the data nodes first, and then on the master node.
There have been no changes made to the way clients should connect to the engine, or any changes to the querying mode or behavior of queries.
Kill-lists have been redesigned in Manticore Search 3. In previous versions, kill-lists were applied to the result set provided by each previously searched table at query time.
Thus, in 2.x, the table order at query time mattered. For example, if a delta table had a kill-list, in order to apply it against the main table, the order had to be main, delta (either in a distributed table or in the FROM clause).
In Manticore 3, kill-lists are applied to a table when it's loaded during searchd startup or gets rotated. The new directive killlist_target in table configuration specifies target tables and defines which doc ids from the source table should be used for suppression. These can be ids from the defined kill-list, actual doc ids of the table or both.
Documents from the kill-lists are deleted from the target tables, they are not returned in results even if the search doesn't include the table that provided the kill-lists. Because of that, the order of tables for searching does not matter anymore. Now, delta, main and main, delta will provide the same results.
In previous versions, tables were rotated following the order from the configuration file. In Manticore 3 table rotation order is much smarter and works in accordance with killlist targets. Before starting to rotate tables, the server looks for chains of tables by killlist_target definitions. It will then first rotate tables not referenced anywhere as kill-lists targets. Next, it will rotate tables targeted by already rotated tables and so on. For example, if we do indexer --all and we have 3 tables: main, delta_big (which targets at the main) and delta_small (with target at delta_big), first, delta_small is rotated, then delta_big and finally the main. This is to ensure that when a dependent table is rotated it gets the most actual kill-list from other tables.
docinfo - everything is now externinplace_docinfo_gap - not needed anymoremva_updates_pool - MVAs don’t have anymore a dedicated pool for updates, as now they can be updated directly in the blob (see below).String, JSON and MVA attributes can be updated in Manticore 3.x using UPDATE statement.
In 2.x string attributes required REPLACE, for JSON it was only possible to update scalar properties (as they were fixed-width) and MVAs could be updated using the MVA pool. Now updates are performed directly on the blob component. One setting that may require tuning is attr_update_reserve which allows changing the allocated extra space at the end of the blob used to avoid frequent resizes in case the new values are bigger than the existing values in the blob.
Doc ids used to be UNSIGNED 64-bit integers. Now they are POSITIVE SIGNED 64-bit integers.
Read here about the RT mode
Manticore 3.x recognizes and parses special suffixes which makes easier to use numeric values with special meaning. Common form for them is integer number + literal, like 10k or 100d, but not 40.3s (since 40.3 is not integer), or not 2d 4h (since there are two, not one value). Literals are case-insensitive, so 10W is the same as 10w. There are 2 types of such suffixes currently supported:
k for kilobytes (1k=1024), m for megabytes (1m=1024k), g for gigabytes (1g=1024m) and t for terabytes (1t=1024g).us for useconds (microseconds), ms for milliseconds, s for seconds, m for minutes, h for hours, d for days and w for weeks.index_converter is a tool for converting tables created with Sphinx/Manticore Search 2.x to the Manticore Search 3.x table format. The tool can be used in several different ways:
$ index_converter --config /home/myuser/manticore.conf --index tablename
$ index_converter --config /home/myuser/manticore.conf --all
$ index_converter --path /var/lib/manticoresearch/data --all
The new version of the table is written by default in the same folder. The previous version's files are saved with the .old extension in their name. An exception is the .spp (hitlists) file, which is the only table component that didn't have any changes in the new format.
You can save the new table version to a different folder using the -–output-dir option
$ index_converter --config /home/myuser/manticore.conf --all --output-dir /new/path
A special case is for tables containing kill-lists. As the behaviour of how kill-lists works has changed (see killlist_target), the delta table should know which are the target tables for applying the kill-lists. There are 3 ways to have a converted table ready for setting targeted tables for applying kill-lists:
Use -–killlist-target when converting a table
ini
$ index_converter --config /home/myuser/manticore.conf --index deltaindex --killlist-target mainindex:kl
Add killlist_target in the configuration before doing the conversion
Here's the complete list of index_converter options:
--config <file> (-c <file> for short) tells index_converter to use the given file as its configuration. Normally, it will look for manticore.conf in the installation directory (e.g. /usr/local/manticore/etc/manticore.conf if installed into /usr/local/sphinx), followed by the current directory you are in when calling index_converter from the shell.--index specifies which table should be converted--path - instead of using a config file, a path containing table(s) can be used--strip-path - strips path from filenames referenced by table: stopwords, exceptions and wordforms--large-docid - allows to convert documents with ids larger than 2^63 and display a warning, otherwise it will just exit on the large id with an error. This option was added as in Manticore 3.x doc ids are signed bigint, while previously they were unsigned--output-dir <dir> - writes the new files in a chosen folder rather than the same location as with the existing table files. When this option set, existing table files will remain untouched at their location.--all - converts all tables from the config--killlist-target <targets> sets the target tables for which kill-lists will be applied. This option should be used only in conjunction with the --index optionYou can install and start Manticore easily on various operating systems, including Ubuntu, Centos, Debian, Windows, and MacOS. Additionally, you can also use Manticore as a Docker container.
wget https://repo.manticoresearch.com/manticore-repo.noarch.deb
sudo dpkg -i manticore-repo.noarch.deb
sudo apt update
sudo apt install manticore manticore-columnar-lib
sudo systemctl start manticore
wget https://repo.manticoresearch.com/manticore-repo.noarch.deb
sudo dpkg -i manticore-repo.noarch.deb
sudo apt update
sudo apt install manticore manticore-columnar-lib
sudo systemctl start manticore
sudo yum install https://repo.manticoresearch.com/manticore-repo.noarch.rpm
sudo yum install manticore manticore-columnar-lib
sudo systemctl start manticore
brew install manticoresearch
brew services start manticoresearch
docker pull manticoresearch/manticore
docker run -e EXTRA=1 --name manticore -p9306:9306 -p9308:9308 -p9312:9312 -d manticoresearch/manticore
By default Manticore is waiting for your connections on:
More details about HTTPS support can be found in our learning course here.
mysql -h0 -P9306
curl -s "http://localhost:9308/search"
// https://github.com/manticoresoftware/manticoresearch-php
require_once __DIR__ . '/vendor/autoload.php';
$config = ['host'=>'127.0.0.1','port'=>9308];
$client = new \Manticoresearch\Client($config);
// https://github.com/manticoresoftware/manticoresearch-python
import manticoresearch
config = manticoresearch.Configuration(
host = "http://127.0.0.1:9308"
)
client = manticoresearch.ApiClient(config)
indexApi = manticoresearch.IndexApi(client)
searchApi = manticoresearch.SearchApi(client)
utilsApi = manticoresearch.UtilsApi(client)
// https://github.com/manticoresoftware/manticoresearch-javascript
var Manticoresearch = require('manticoresearch');
var client= new Manticoresearch.ApiClient()
client.basePath="http://127.0.0.1:9308";
indexApi = new Manticoresearch.IndexApi(client);
searchApi = new Manticoresearch.SearchApi(client);
utilsApi = new Manticoresearch.UtilsApi(client);
// https://github.com/manticoresoftware/manticoresearch-java
import com.manticoresearch.client.*;
import com.manticoresearch.client.model.*;
import com.manticoresearch.client.api.*;
...
ApiClient client = Configuration.getDefaultApiClient();
client.setBasePath("http://127.0.0.1:9308");
...
IndexApi indexApi = new IndexApi(client);
SearchApi searchApi = new UtilsApi(client);
UtilsApi utilsApi = new UtilsApi(client);
// https://github.com/manticoresoftware/manticoresearch-net
using System.Net.Http;
...
using ManticoreSearch.Client;
using ManticoreSearch.Api;
using ManticoreSearch.Model;
...
config = new Configuration();
config.BasePath = "http://localhost:9308";
httpClient = new HttpClient();
httpClientHandler = new HttpClientHandler();
...
var indexApi = new IndexApi(httpClient, config, httpClientHandler);
var searchApi = new SearchApi(httpClient, config, httpClientHandler);
var utilsApi = new UtilsApi(httpClient, config, httpClientHandler);
import {
Configuration,
IndexApi,
SearchApi,
UtilsApi
} from "manticoresearch-ts";
...
const config = new Configuration({
basePath: 'http://localhost:9308',
})
const indexApi = new IndexApi(config);
const searchApi = new SearchApi(config);
const utilsApi = new UtilsApi(config);
import (
"context"
manticoreclient "github.com/manticoresoftware/manticoresearch-go"
)
...
configuration := manticoreclient.NewConfiguration()
configuration.Servers[0].URL = "http://localhost:9308"
apiClient := manticoreclient.NewAPIClient(configuration)
Let's now create a table called "products" with 2 fields:
Note that it is possible to omit creating a table with an explicit create statement. For more information, see Auto schema.
More information about different ways to create a table can be found in our learning courses:
create table products(title text, price float) morphology='stem_en';
Query OK, 0 rows affected (0.02 sec)
POST /cli -d "create table products(title text, price float) morphology='stem_en'"
{
"total":0,
"error":"",
"warning":""
}
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float'],
],['morphology' => 'stem_en']);
utilsApi.sql('create table products(title text, price float) morphology=\'stem_en\'')
res = await utilsApi.sql('create table products(title text, price float) morphology=\'stem_en\'');
utilsApi.sql("create table products(title text, price float) morphology='stem_en'");
utilsApi.Sql("create table products(title text, price float) morphology='stem_en'");
res = await utilsApi.sql('create table products(title text, price float) morphology=\'stem_en\'');
res := apiClient.UtilsAPI.Sql(context.Background()).Body("create table products(title text, price float) morphology='stem_en'").Execute();
Let's now add few documents to the table:
insert into products(title,price) values ('Crossbody Bag with Tassel', 19.85), ('microfiber sheet set', 19.99), ('Pet Hair Remover Glove', 7.99);
Query OK, 3 rows affected (0.01 sec)
POST /insert
{
"index":"products",
"doc":
{
"title" : "Crossbody Bag with Tassel",
"price" : 19.85
}
}
POST /insert
{
"index":"products",
"doc":
{
"title" : "microfiber sheet set",
"price" : 19.99
}
}
POST /insert
{
"index":"products",
"doc":
{
"title" : "Pet Hair Remover Glove",
"price" : 7.99
}
}
{
"_index": "products",
"_id": 0,
"created": true,
"result": "created",
"status": 201
}
{
"_index": "products",
"_id": 0,
"created": true,
"result": "created",
"status": 201
}
{
"_index": "products",
"_id": 0,
"created": true,
"result": "created",
"status": 201
}
$index->addDocuments([
['title' => 'Crossbody Bag with Tassel', 'price' => 19.85],
['title' => 'microfiber sheet set', 'price' => 19.99],
['title' => 'Pet Hair Remover Glove', 'price' => 7.99]
]);
indexApi.insert({"index" : "products", "doc" : {"title" : "Crossbody Bag with Tassel", "price" : 19.85}})
indexApi.insert({"index" : "products", "doc" : {"title" : "microfiber sheet set", "price" : 19.99}})
indexApi.insert({"index" : "products", "doc" : {"title" : "Pet Hair Remover Glove", "price" : 7.99}})
res = await indexApi.insert({"index" : "products", "doc" : {"title" : "Crossbody Bag with Tassel", "price" : 19.85}});
res = await indexApi.insert({"index" : "products", "doc" : {"title" : "microfiber sheet set", "price" : 19.99}});
res = await indexApi.insert({"index" : "products", "doc" : {"title" : "Pet Hair Remover Glove", "price" : 7.99}});
InsertDocumentRequest newdoc = new InsertDocumentRequest();
HashMap<String,Object> doc = new HashMap<String,Object>(){{
put("title","Crossbody Bag with Tassel");
put("price",19.85);
}};
newdoc.index("products").setDoc(doc);
sqlresult = indexApi.insert(newdoc);
newdoc = new InsertDocumentRequest();
doc = new HashMap<String,Object>(){{
put("title","microfiber sheet set");
put("price",19.99);
}};
newdoc.index("products").setDoc(doc);
sqlresult = indexApi.insert(newdoc);
newdoc = new InsertDocumentRequest();
doc = new HashMap<String,Object>(){{
put("title","Pet Hair Remover Glove");
put("price",7.99);
}};
newdoc.index("products").setDoc(doc);
indexApi.insert(newdoc);
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("title","Crossbody Bag with Tassel");
doc.Add("price",19.85);
InsertDocumentRequest insertDocumentRequest = new InsertDocumentRequest(index: "products", doc: doc);
sqlresult = indexApi.Insert(insertDocumentRequest);
doc = new Dictionary<string, Object>();
doc.Add("title","microfiber sheet set");
doc.Add("price",19.99);
insertDocumentRequest = new InsertDocumentRequest(index: "products", doc: doc);
sqlresult = indexApi.Insert(insertDocumentRequest);
doc = new Dictionary<string, Object>();
doc.Add("title","Pet Hair Remover Glove");
doc.Add("price",7.99);
insertDocumentRequest = new InsertDocumentRequest(index: "products", doc: doc);
sqlresult = indexApi.Insert(insertDocumentRequest);
res = await indexApi.insert({
index: 'test',
id: 1,
doc: { content: 'Text 1', name: 'Doc 1', cat: 1 },
});
res = await indexApi.insert({
index: 'test',
id: 2,
doc: { content: 'Text 2', name: 'Doc 2', cat: 2 },
});
res = await indexApi.insert({
index: 'test',
id: 3,
doc: { content: 'Text 3', name: 'Doc 3', cat: 7 },
});
indexDoc := map[string]interface{} {"content": "Text 1", "name": "Doc 1", "cat": 1 }
indexReq := manticoreclient.NewInsertDocumentRequest("products", indexDoc)
indexReq.SetId(1)
apiClient.IndexAPI.Insert(context.Background()).InsertDocumentRequest(*indexReq).Execute()
indexDoc = map[string]interface{} {"content": "Text 2", "name": "Doc 3", "cat": 2 }
indexReq = manticoreclient.NewInsertDocumentRequest("products", indexDoc)
indexReq.SetId(2)
apiClient.IndexAPI.Insert(context.Background()).InsertDocumentRequest(*indexReq).Execute()
indexDoc = map[string]interface{} {"content": "Text 3", "name": "Doc 3", "cat": 7 }
indexReq = manticoreclient.NewInsertDocumentRequest("products", indexDoc)
indexReq.SetId(3)
apiClient.IndexAPI.Insert(context.Background()).InsertDocumentRequest(*indexReq).Execute()
More details on the subject can be found here:
Let's find one of the documents. The query we will use is 'remove hair'. As you can see, it finds a document with the title 'Pet Hair Remover Glove' and highlights 'Hair remover' in it, even though the query has "remove", not "remover". This is because when we created the table, we turned on using English stemming (morphology "stem_en").
select id, highlight(), price from products where match('remove hair');
+---------------------+-------------------------------+----------+
| id | highlight() | price |
+---------------------+-------------------------------+----------+
| 1513686608316989452 | Pet <b>Hair Remover</b> Glove | 7.990000 |
+---------------------+-------------------------------+----------+
1 row in set (0.00 sec)
POST /search
{
"index": "products",
"query": { "match": { "title": "remove hair" } },
"highlight":
{
"fields": ["title"]
}
}
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1,
"hits": [
{
"_id": "1513686608316989452",
"_score": 1680,
"_source": {
"price": 7.99,
"title": "Pet Hair Remover Glove"
},
"highlight": {
"title": [
"Pet <b>Hair Remover</b> Glove"
]
}
}
]
}
}
$result = $index->search('@title remove hair')->highlight(['title'])->get();
foreach($result as $doc)
{
echo "Doc ID: ".$doc->getId()."\n";
echo "Doc Score: ".$doc->getScore()."\n";
echo "Document fields:\n";
print_r($doc->getData());
echo "Highlights: \n";
print_r($doc->getHighlight());
}
Doc ID: 1513686608316989452
Doc Score: 1680
Document fields:
Array
(
[price] => 7.99
[title] => Pet Hair Remover Glove
)
Highlights:
Array
(
[title] => Array
(
[0] => Pet <b>Hair Remover</b> Glove
)
)
searchApi.search({"index":"products","query":{"query_string":"@title remove hair"},"highlight":{"fields":["title"]}})
{'hits': {'hits': [{u'_id': u'1513686608316989452',
u'_score': 1680,
u'_source': {u'title': u'Pet Hair Remover Glove', u'price':7.99},
u'highlight':{u'title':[u'Pet <b>Hair Remover</b> Glove']}}}],
'total': 1},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"products","query":{"query_string":"@title remove hair"}"highlight":{"fields":["title"]}});
{"hits": {"hits": [{"_id": "1513686608316989452",
"_score": 1680,
"_source": {"title": "Pet Hair Remover Glove", "price":7.99},
"highlight":{"title":["Pet <b>Hair Remover</b> Glove"]}}],
"total": 1},
"profile": None,
"timed_out": False,
"took": 0}
query = new HashMap<String,Object>();
query.put("query_string","@title remove hair");
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
searchRequest.setQuery(query);
HashMap<String,Object> highlight = new HashMap<String,Object>(){{
put("fields",new String[] {"title"});
}};
searchRequest.setHighlight(highlight);
searchResponse = searchApi.search(searchRequest);
class SearchResponse {
took: 84
timedOut: false
hits: class SearchResponseHits {
total: 1
maxScore: null
hits: [{_id=1513686608316989452, _score=1, _source={price=7.99, title=Pet Hair Remover Glove}, highlight={title=[Pet <b>Hair Remover</b> Glove]}}]
aggregations: null
}
profile: null
}
object query = new { query_string="@title remove hair" };
var searchRequest = new SearchRequest("products", query);
var highlight = new Highlight();
highlight.Fieldnames = new List<string> {"title"};
searchRequest.Highlight = highlight;
searchResponse = searchApi.Search(searchRequest);
class SearchResponse {
took: 103
timedOut: false
hits: class SearchResponseHits {
total: 1
maxScore: null
hits: [{_id=1513686608316989452, _score=1, _source={price=7.99, title=Pet Hair Remover Glove}, highlight={title=[Pet <b>Hair Remover</b> Glove]}}]
aggregations: null
}
profile: null
}
res = await searchApi.search({
index: 'test',
query: { query_string: {'text 1'} },
highlight: {'fields': ['content'] }
});
{
"hits":
{
"hits":
[{
"_id": "1",
"_score": 1400,
"_source": {"content":"Text 1","name":"Doc 1","cat":1},
"highlight": {"content":["<b>Text 1</b>"]}
}],
"total": 1
},
"profile": None,
"timed_out": False,
"took": 0
}
searchRequest := manticoreclient.NewSearchRequest("test")
query := map[string]interface{} {"query_string": "text 1"};
searchRequest.SetQuery(query);
highlightField := manticoreclient.NewHighlightField("content")
fields := []interface{}{ highlightField }
highlight := manticoreclient.NewHighlight()
highlight.SetFields(fields)
searchRequest.SetHighlight(highlight);
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"hits":
{
"hits":
[{
"_id": "1",
"_score": 1400,
"_source": {"content":"Text 1","name":"Doc 1","cat":1},
"highlight": {"content":["<b>Text 1</b>"]}
}],
"total": 1
},
"profile": None,
"timed_out": False,
"took": 0
}
More information on different search options available in Manticore can be found in our learning courses:
Let's assume we now want to update the document - change the price to 18.5. This can be done by filtering by any field, but normally you know the document id and update something based on that.
update products set price=18.5 where id = 1513686608316989452;
Query OK, 1 row affected (0.00 sec)
POST /update
{
"index": "products",
"id": 1513686608316989452,
"doc":
{
"price": 18.5
}
}
{
"_index": "products",
"_id": 1513686608316989452,
"result": "updated"
}
$doc = [
'body' => [
'index' => 'products',
'id' => 2,
'doc' => [
'price' => 18.5
]
]
];
$response = $client->update($doc);
indexApi = api = manticoresearch.IndexApi(client)
indexApi.update({"index" : "products", "id" : 1513686608316989452, "doc" : {"price":18.5}})
res = await indexApi.update({"index" : "products", "id" : 1513686608316989452, "doc" : {"price":18.5}});
UpdateDocumentRequest updateRequest = new UpdateDocumentRequest();
doc = new HashMap<String,Object >(){{
put("price",18.5);
}};
updateRequest.index("products").id(1513686608316989452L).setDoc(doc);
indexApi.update(updateRequest);
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("price", 18.5);
UpdateDocumentRequest updateDocumentRequest = new UpdateDocumentRequest(index: "products", id: 1513686608316989452L, doc: doc);
indexApi.Update(updateDocumentRequest);
res = await indexApi.update({ index: "test", id: 1, doc: { cat: 10 } });
updDoc = map[string]interface{} {"cat": 10}
updRequest = manticoreclient.NewUpdateDocumentRequest("test", updDoc)
updRequest.SetId(1)
res, _, _ = apiClient.IndexAPI.Update(context.Background()).UpdateDocumentRequest(*updRequest).Execute()
Let's now delete all documents with price lower than 10.
delete from products where price < 10;
Query OK, 1 row affected (0.00 sec)
POST /delete
{
"index": "products",
"query":
{
"range":
{
"price":
{
"lte": 10
}
}
}
}
{
"_index": "products",
"deleted": 1
}
$result = $index->deleteDocuments(new \Manticoresearch\Query\Range('price',['lte'=>10]));
Array
(
[_index] => products
[deleted] => 1
)
indexApi.delete({"index" : "products", "query": {"range":{"price":{"lte":10}}}})
res = await indexApi.delete({"index" : "products", "query": {"range":{"price":{"lte":10}}}});
DeleteDocumentRequest deleteRequest = new DeleteDocumentRequest();
query = new HashMap<String,Object>();
query.put("range",new HashMap<String,Object>(){{
put("price",new HashMap<String,Object>(){{
put("lte",10);
}});
}});
deleteRequest.index("products").setQuery(query);
indexApi.delete(deleteRequest);
Dictionary<string, Object> price = new Dictionary<string, Object>();
price.Add("lte", 10);
Dictionary<string, Object> range = new Dictionary<string, Object>();
range.Add("price", price);
DeleteDocumentRequest deleteDocumentRequest = new DeleteDocumentRequest(index: "products", range: range);
indexApi.Delete(deleteDocumentRequest);
res = await indexApi.delete({
index: 'test',
query: { match: { '*': 'Text 1' } },
});
delRequest := manticoreclient.NewDeleteDocumentRequest("test")
matchExpr := map[string]interface{} {"*": "Text 1t"}
delQuery := map[string]interface{} {"match": matchExpr }
delRequest.SetQuery(delQuery)
res, _, _ := apiClient.IndexAPI.Delete(context.Background()).DeleteDocumentRequest(*delRequest).Execute();
Manticore Search server can be started using different methods, depending on the installation type.
After the installation the Manticore Search service is not started automatically. To start Manticore run the following command:
sudo systemctl start manticore
To stop Manticore run the following command:
sudo systemctl stop manticore
The Manticore service is set to run at boot. You can check it by running:
sudo systemctl is-enabled manticore
If you want to disable Manticore from starting at boot time, run:
sudo systemctl disable manticore
To make Manticore start at boot, run:
sudo systemctl enable manticore
searchd process logs startup information in systemd journal. If systemd logging is enabled you can view the logged information with the following command:
sudo journalctl -u manticore
systemctl set-environment _ADDITIONAL_SEARCHD_PARAMS allows you to specify custom startup flags that the Manticore Search daemon should be started with. See full list here.
For example, to start Manticore with the debug logging level, you can run:
systemctl set-environment _ADDITIONAL_SEARCHD_PARAMS='--logdebug'
systemctl restart manticore
To undo it, run:
systemctl set-environment _ADDITIONAL_SEARCHD_PARAMS=''
systemctl restart manticore
Note, systemd environment variables get reset on server reboot.
Manticore can be started and stopped using service commands:
sudo service manticore start
sudo service manticore stop
To enable the sysV service at boot on RedHat systems run:
chkconfig manticore on
To enable the sysV service at boot on Debian systems (including Ubuntu) run:
update-rc.d manticore defaults
Please note that searchd is started by the init system under the manticore user and all files created by the server will be owned by this user. If searchd is started under, for example, the root user, the file permissions will be changed, which may result in issues when running searchd as a service again.
You can also start Manticore Search by calling searchd (Manticore Search server binary) directly:
searchd [OPTIONS]
Note that without specifying a path to the configuration file, searchd will try to find it in several locations depending on the operating system.
The options available to searchd in all operating systems are:
--help (-h for short) lists all of the parameters that can be used in your particular build of searchd.--version (-v for short) shows Manticore Search version information.--config <file> (-c <file> for short) tells searchd to use the specified file as its configuration.--stop is used to asynchronously stop searchd, using the details of the PID file as specified in the Manticore configuration file. Therefore, you may also need to confirm to searchd which configuration file to use with the --config option. Example:bash
$ searchd --config /etc/manticoresearch/manticore.conf --stop
--stopwait is used to synchronously stop searchd. --stop essentially tells the running instance to exit (by sending it a SIGTERM) and then immediately returns. --stopwait will also attempt to wait until the running searchd instance actually finishes the shutdown (eg. saves all the pending attribute changes) and exits. Example:bash
$ searchd --config /etc/manticoresearch/manticore.conf --stopwait
Possible exit codes are as follows:
3 if server crashed during shutdown
--status command is used to query running searchd instance status using the connection details from the (optionally) provided configuration file. It will try to connect to running instance using the first found UNIX socket or TCP port from the configuration file. On success it will query for a number of status and performance counter values and print them. You can also use SHOW STATUS command to access the very same counters via SQL protocol. Examples:
bash
$ searchd --status
$ searchd --config /etc/manticoresearch/manticore.conf --status
--pidfile is used to explicitly force using a PID file (where the searchd process identification number is stored) despite any other debugging options that say otherwise (for instance, --console). This is a debugging option.bash
$ searchd --console --pidfile
--console is used to force searchd into console mode. Typically, Manticore runs as a conventional server application and logs information into log files (as specified in the configuration file). However, when debugging issues in the configuration or the server itself, or trying to diagnose hard-to-track-down problems, it may be easier to force it to dump information directly to the console/command line from which it is being called. Running in console mode also means that the process will not be forked (so searches are done in sequence) and logs will not be written to. (It should be noted that console mode is not the intended method for running searchd.) You can invoke it as:bash
$ searchd --config /etc/manticoresearch/manticore.conf --console
--logdebug, --logreplication, --logdebugv, and --logdebugvv options enable additional debug output in the server log. They differ by the logging verboseness level. These are debugging options and should not be normally enabled, as they can pollute the log a lot. They can be used temporarily on request to assist with complicated debugging sessions.
--iostats is used in conjunction with the logging options (the query_log must have been activated in manticore.conf) to provide more detailed information on a per-query basis about the input/output operations carried out in the course of that query, with a slight performance hit and slightly bigger logs. The IO statistics don't include information about IO operations for attributes, as these are loaded with mmap. To enable it, you can start searchd as follows:
bash
$ searchd --config /etc/manticoresearch/manticore.conf --iostats
--cpustats is used to provide actual CPU time report (in addition to wall time) in both query log file (for every given query) and status report (aggregated). It depends on clock_gettime() Linux system call or falls back to less precise call on certain systems. You might start searchd thus:bash
$ searchd --config /etc/manticoresearch/manticore.conf --cpustats
--port portnumber (-p for short) is used to specify the port that Manticore should listen on to accept binary protocol requests, usually for debugging purposes. This will usually default to 9312, but sometimes you need to run it on a different port. Specifying it on the command line will override anything specified in the configuration file. The valid range is 0 to 65535, but ports numbered 1024 and below usually require a privileged account in order to run.An example of usage:
bash
$ searchd --port 9313
--listen ( address ":" port | port | path ) [ ":" protocol ] (or -l for short) Works as --port, but allows you to specify not only the port, but the full path, IP address and port, or Unix-domain socket path that searchd will listen on. In other words, you can specify either an IP address (or hostname) and port number, just a port number, or a Unix socket path. If you specify a port number but not the address, searchd will listen on all network interfaces. A Unix path is identified by a leading slash. As the last parameter, you can also specify a protocol handler (listener) to be used for connections on this socket. Supported protocol values are 'sphinx' and 'mysql' (MySQL protocol used since 4.1).
--force-preread forbids the server from serving any incoming connection until prereading of table files completes. By default, at startup, the server accepts connections while table files are lazy-loaded into memory. This extends the behavior and makes it wait until the files are loaded.
--index (--table) <table> (or -i (-t) <table> for short) forces this instance of searchd to only serve the specified table. Like --port, above, this is usually for debugging purposes; more long-term changes would generally be applied to the configuration file itself.
--strip-path strips the path names from all the file names referenced from the table (stopwords, wordforms, exceptions, etc). This is useful for picking up tables built on another machine with possibly different path layouts.
--replay-flags=<OPTIONS> switch can be used to specify a list of extra binary log replay options. The supported options are:
accept-desc-timestamp, ignore descending transaction timestamps and replay such transactions anyway (the default behavior is to exit with an error).ignore-open-errors, ignore missing binlog files (the default behavior is to exit with an error).ignore-trx-errors, ignore any transaction errors and skip current binlog file (the default behavior is to exit with an error).ignore-all-errors, ignore any errors described above (the default behavior is to exit with an error).Example:
bash
$ searchd --replay-flags=accept-desc-timestamp
--coredump is used to enable saving a core file or a minidump of the server on crash. Disabled by default to speed up of server restart on crash. This is useful for debugging purposes.
bash
$ searchd --config /etc/manticoresearch/manticore.conf --coredump
--new-cluster bootstraps a replication cluster and makes the server a reference node with cluster restart protection. On Linux you can also run manticore_new_cluster. It will start Manticore in --new-cluster mode via systemd.
--new-cluster-force bootstraps a replication cluster and makes the server a reference node bypassing cluster restart protection. On Linux you can also run manticore_new_cluster --force. It will start Manticore in --new-cluster-force mode via systemd.
There are some options for searchd that are specific to Windows platforms, concerning handling as a service, and are only available in Windows binaries.
Note that in Windows searchd will default to --console mode, unless you install it as a service.
--install installs searchd as a service into the Microsoft Management Console (Control Panel / Administrative Tools / Services). Any other parameters specified on the command line, where --install is specified will also become part of the command line on future starts of the service. For example, as a part of calling searchd, you will likely also need to specify the configuration file with --config, and you would do that as well as specifying --install. Once called, the usual start/stop facilities will become available via the management console, so any methods you could use for starting, stopping and restarting services would also apply to searchd. Example:bat
C:\WINDOWS\system32> C:\Manticore\bin\searchd.exe --install
--config C:\Manticore\manticore.conf
If you want to have the I/O stats every time you start searchd, you need to specify the option on the same line as the --install command thus:
bat
C:\WINDOWS\system32> C:\Manticore\bin\searchd.exe --install
--config C:\Manticore\manticore.conf --iostats
--delete removes the service from the Microsoft Management Console and other places where services are registered, after previously being installed with --install. Note that this does not uninstall the software or delete the tables. It means the service will not be called from the services system, and will not be started on the machine's next start. If currently running as a service, the current instance will not be terminated (until the next reboot or until --stop). If the service was installed with a custom name (with --servicename), the same name will need to be specified with --servicename when calling to uninstall. Example:bat
C:\WINDOWS\system32> C:\Manticore\bin\searchd.exe --delete
--servicename <name> applies the given name to searchd when installing or deleting the service, as it would appear in the Management Console; this will default to searchd, but if being deployed on servers where multiple administrators may log in to the system, or a system with multiple searchd instances, a more descriptive name may be applicable. Note that unless combined with --install or --delete, this option does not do anything. Example:
bat
C:\WINDOWS\system32> C:\Manticore\bin\searchd.exe --install
--config C:\Manticore\manticore.conf --servicename ManticoreSearch
--ntservice is an option that is passed by the Microsoft Management Console to searchd to invoke it as a service on Windows platforms. It would not normally be necessary to call this directly; this would normally be called by Windows when the service is started, although if you wanted to call this as a regular service from the command-line (as the complement to --console) you could do so in theory.
--safetrace forces searchd to only use the system's backtrace() call in crash reports. In certain (rare) scenarios, this might be a "safer" way to get that report. This is a debugging option.
--nodetach switch (Linux only) tells searchd not to detach into the background. This will also cause log entries to be printed out to the console. Query processing operates as usual. This is a debugging option and might also be useful when you run Manticore in a Docker container to capture its output.
Manticore utilizes the plugin_dir for storing and using Manticore Buddy plugins. By default, this value is accessible to the "manticore" user in a standard installation. However, if you start the searchd daemon manually with a different user, the daemon might not have access to the plugin_dir. To address this problem, ensure you specify a plugin_dir in the common section that the user running the searchd daemon can write to.
searchd supports a number of signals:
SIGTERM - Initiates a clean shutdown. New queries will not be handled, but queries that are already started will not be forcibly interrupted.SIGHUP - Initiates tables rotation. Depending on the value of seamless_rotate setting, new queries might be shortly stalled; clients will receive temporary errors.SIGUSR1 - Forces reopen of searchd log and query log files, allowing for log file rotation.MANTICORE_TRACK_DAEMON_SHUTDOWN=1 enables detailed logging while searchd is shutting down. It's useful in case of some shutdown problems, such as when Manticore takes too long to shut down or freezes during the shutdown process.The image is based on current release of Manticore package.
The default configuration includes a sample Real-Time table and listens on the default ports:
9306 for connections from a MySQL client9308 for connections via HTTP9312 for connections via a binary protocol (e.g. in case you run a cluster)The image comes with libraries for easy indexing data from MySQL, PostgreSQL, XML and CSV files.
The below is the simplest way to start Manticore in a container and log in to it via the mysql client:
docker run -e EXTRA=1 --name manticore --rm -d manticoresearch/manticore && echo "Waiting for Manticore docker to start. Consider mapping the data_dir to make it start faster next time" && until docker logs manticore 2>&1 | grep -q "accepting connections"; do sleep 1; echo -n .; done && echo && docker exec -it manticore mysql && docker stop manticore
Note that upon exiting the MySQL client, the Manticore container will be stopped and removed, resulting in no saved data. For information on using Manticore in a production environment, please see below.
The image comes with a sample table that can be loaded like this:
mysql> source /sandbox.sql
Also, the mysql client has several sample queries in its history that you can run on the above table, just use Up/Down keys in the client to see and run them.
For data persistence the folder /var/lib/manticore/ should be mounted to local storage or other desired storage engine.
The configuration file inside the instance is located at /etc/manticoresearch/manticore.conf. For custom settings, this file should be mounted to your own configuration file.
The ports are 9306/9308/9312 for SQL/HTTP/Binary, expose them depending on how you are going to use Manticore. For example:
docker run -e EXTRA=1 --name manticore -v $(pwd)/data:/var/lib/manticore -p 127.0.0.1:9306:9306 -p 127.0.0.1:9308:9308 -d manticoresearch/manticore
or
docker run -e EXTRA=1 --name manticore -v $(pwd)/manticore.conf:/etc/manticoresearch/manticore.conf -v $(pwd)/data:/var/lib/manticore/ -p 127.0.0.1:9306:9306 -p 127.0.0.1:9308:9308 -d manticoresearch/manticore
Make sure to remove 127.0.0.1: if you want the ports to be available for external hosts.
The Manticore Search Docker image doesn't come with the Manticore Columnar Library pre-installed, which is necessary if you require columnar storage and secondary indexes. However, it can easily be enabled during runtime by setting the environment variable EXTRA=1. For example, docker run -e EXTRA=1 ... manticoresearch/manticore. This will download and install the library in the data directory (which is typically mapped as a volume in production environments) and it won't be re-downloaded unless the Manticore Search version is changed.
Using EXTRA=1 also activates Manticore Buddy, which is used for processing certain commands. For more information, refer to the changelog.
If you only need the MCL, you can use the environment variable MCL=1.
In many cases, you may want to use Manticore in conjunction with other images specified in a Docker Compose YAML file. Below is the minimal recommended configuration for Manticore Search in a docker-compose.yml file:
version: '2.2'
services:
manticore:
container_name: manticore
image: manticoresearch/manticore
environment:
- EXTRA=1
restart: always
ports:
- 127.0.0.1:9306:9306
- 127.0.0.1:9308:9308
ulimits:
nproc: 65535
nofile:
soft: 65535
hard: 65535
memlock:
soft: -1
hard: -1
volumes:
- ./data:/var/lib/manticore
# - ./manticore.conf:/etc/manticoresearch/manticore.conf # uncomment if you use a custom config
Besides using the exposed ports 9306 and 9308, you can log into the instance by running docker-compose exec manticore mysql.
HTTP protocol is exposed on port 9308. You can map the port locally and connect using curl.:
docker run -e EXTRA=1 --name manticore -p 9308:9308 -d manticoresearch/manticore
Create a table:
POST /cli -d 'CREATE TABLE testrt ( title text, content text, gid integer)'
Insert a document:
POST /insert
-d'{"index":"testrt","id":1,"doc":{"title":"Hello","content":"world","gid":1}}'
Perform a simple search:
POST /search -d '{"index":"testrt","query":{"match":{"*":"hello world"}}}'
By default, the server is set to send its logging to /dev/stdout, which can be viewed from the host with:
docker logs manticore
The query log can be diverted to Docker log by passing the variable QUERY_LOG_TO_STDOUT=true.
Here is a simple docker-compose.yml for defining a two node cluster:
version: '2.2'
services:
manticore-1:
image: manticoresearch/manticore
environment:
- EXTRA=1
restart: always
ulimits:
nproc: 65535
nofile:
soft: 65535
hard: 65535
memlock:
soft: -1
hard: -1
networks:
- manticore
manticore-2:
image: manticoresearch/manticore
environment:
- EXTRA=1
restart: always
ulimits:
nproc: 65535
nofile:
soft: 65535
hard: 65535
memlock:
soft: -1
hard: -1
networks:
- manticore
networks:
manticore:
driver: bridge
docker-compose upmysql> CREATE TABLE testrt ( title text, content text, gid integer);
mysql> CREATE CLUSTER posts;
Query OK, 0 rows affected (0.24 sec)
mysql> ALTER CLUSTER posts ADD testrt;
Query OK, 0 rows affected (0.07 sec)
MySQL [(none)]> exit
Bye
```
mysql> JOIN CLUSTER posts AT 'manticore-1:9312';
mysql> INSERT INTO posts:testrt(title,content,gid) VALUES('hello','world',1);
Query OK, 1 row affected (0.00 sec)
MySQL [(none)]> exit
Bye
```
MySQL [(none)]> select * from testrt;
+---------------------+------+-------+---------+
| id | gid | title | content |
+---------------------+------+-------+---------+
| 3891565839006040065 | 1 | hello | world |
+---------------------+------+-------+---------+
1 row in set (0.00 sec)
MySQL [(none)]> exit
Bye
```
It's recommended to overwrite the default ulimits of docker for the Manticore instance:
--ulimit nofile=65536:65536
For best performance, table components can be "mlocked" into memory. When Manticore is run under Docker, the instance requires additional privileges to allow memory locking. The following options must be added when running the instance:
--cap-add=IPC_LOCK --ulimit memlock=-1:-1
If you want to run Manticore with a custom configuration that includes table definitions, you will need to mount the configuration to the instance:
docker run -e EXTRA=1 --name manticore -v $(pwd)/manticore.conf:/etc/manticoresearch/manticore.conf -v $(pwd)/data/:/var/lib/manticore -p 127.0.0.1:9306:9306 -d manticoresearch/manticore
Take into account that Manticore search inside the container is run under user manticore. Performing operations with table files (like creating or rotating plain tables) should be also done under manticore. Otherwise the files will be created under root and the search daemon won't have rights to open them. For example here is how you can rotate all tables:
docker exec -it manticore gosu manticore indexer --all --rotate
You can also set individual searchd and common configuration settings using Docker environment variables.
The settings must be prefixed with their section name, example for in case of mysql_version_string the variable must be named searchd_mysql_version_string:
docker run -e EXTRA=1 --name manticore -p 127.0.0.1:9306:9306 -e searchd_mysql_version_string='5.5.0' -d manticoresearch/manticore
In case of the listen directive, new listening interfaces using the Docker variable searchd_listen in addition to the default ones. Multiple interfaces can be declared, separated by a semi-colon ("|"). To listen only on a network address, the $ip (retrieved internally from hostname -i) can be used as address alias.
For example -e searchd_listen='9316:http|9307:mysql|$ip:5443:mysql_vip' will add an additional SQL interface on port 9307, an SQL VIP listener on port 5443 running only on the instance's IP, and an HTTP listener on port 9316, in addition to the defaults on 9306 and 9308, respectively.
$ docker run -e EXTRA=1 --rm -p 1188:9307 -e searchd_mysql_version_string='5.5.0' -e searchd_listen='9316:http|9307:mysql|$ip:5443:mysql_vip' manticore
[Mon Aug 17 07:31:58.719 2020] [1] using config file '/etc/manticoresearch/manticore.conf' (9130 chars)...
listening on all interfaces for http, port=9316
listening on all interfaces for mysql, port=9307
listening on 172.17.0.17:5443 for VIP mysql
listening on all interfaces for mysql, port=9306
listening on UNIX socket /var/run/mysqld/mysqld.sock
listening on 172.17.0.17:9312 for sphinx
listening on all interfaces for http, port=9308
prereading 0 indexes
prereaded 0 indexes in 0.000 sec
accepting connections
To start Manticore with custom startup flags, specify them as arguments when using docker run. Ensure you do not include the searchd command and include the --nodetach flag. Here's an example:
docker run -e EXTRA=1 --name manticore --rm manticoresearch/manticore:latest --replay-flags=ignore-trx-errors --nodetach
By default, the main Manticore process searchd is running under user manticore inside the container, but the script which runs on starting the container is run under your default docker user which in most cases is root. If that's not what you want you can use docker ... --user manticore or user: manticore in docker compose yaml to make everything run under manticore. Read below about possible volume permissions issue you can get and how to solve it.
To build plain tables specified in your custom configuration file, you can use the CREATE_PLAIN_TABLES=1 environment variable. It will execute indexer --all before Manticore starts. This is useful if you don't use volumes, and your tables are easy to recreate.
docker run -e CREATE_PLAIN_TABLES=1 --name manticore -v $(pwd)/manticore.conf:/etc/manticoresearch/manticore.conf -p 9306:9306 -p 9308:9308 -d manticoresearch/manticore
In case you are running Manticore Search docker under non-root (using docker ... --user manticore or user: manticore in docker compose yaml), you can face a permissions issue, for example:
FATAL: directory /var/lib/manticore write error: failed to open /var/lib/manticore/tmp: Permission denied
or in case you are using -e EXTRA=1:
mkdir: cannot create directory ‘/var/lib/manticore/.mcl/’: Permission denied
This can happen because the user which is used to run processes inside the container may have no permissions to modify the directory you have mounted to the container. To fix it you can chown or chmod the mounted directory. If you run the container under user manticore you need to do:
chown -R 999:999 data
since user manticore has ID 999 inside the container.
On Windows, if you want Manticore to start at boot, you can install it as a Windows Service. You can follow the instructions in the Manticore as Windows Service guide to install Manticore as a service.
Once Manticore is installed as a service, you can start and stop it from the Control Panel or from the command line using the sc.exe command.
sc.exe start Manticore
sc.exe stop Manticore
Alternatively, if you don't install Manticore as a Windows service, you can start it from the command line by running the following command:
.\bin\searchd -c manticore.conf
This command assumes that you have the Manticore's binary and the configuration file in the current directory.
If Manticore is installed using HomeBrew, you can run it as a Brew service.
To start Manticore, run the following command:
brew services start manticoresearch
To stop Manticore, run the following command:
brew services stop manticoresearch
Manticore's data types can be split into two categories: full-text fields and attributes.
Full-text fields:
Full-text fields are represented by the data type text. All other data types are called "attributes".
Attributes are non-full-text values associated with each document that can be used to perform non-full-text filtering, sorting and grouping during a search.
It is often desired to process full-text search results based not only on matching document ID and its rank, but also on a number of other per-document values. For example, one might need to sort news search results by date and then relevance, or search through products within a specified price range, or limit a blog search to posts made by selected users, or group results by month. To do this efficiently, Manticore enables not only full-text fields, but also additional attributes to be added to each document. These attributes can be used to filter, sort, or group full-text matches, or to search only by attributes.
The attributes, unlike full-text fields, are not full-text indexed. They are stored in the table, but it is not possible to search them as full-text.
A good example for attributes would be a forum posts table. Assume that only the title and content fields need to be full-text searchable - but that sometimes it is also required to limit search to a certain author or a sub-forum (i.e., search only those rows that have some specific values of author_id or forum_id); or to sort matches by post_date column; or to group matching posts by month of the post_date and calculate per-group match counts.
CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp);
POST /cli -d "CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)"
$index = new \Manticoresearch\Index($client);
$index->setName('forum');
$index->create([
'title'=>['type'=>'text'],
'content'=>['type'=>'text'],
'author_id'=>['type'=>'int'],
'forum_id'=>['type'=>'int'],
'post_date'=>['type'=>'timestamp']
]);
utilsApi.sql('CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)')
res = await utilsApi.sql('CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)');
utilsApi.sql("CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)");
utilsApi.Sql("CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)");
table forum
{
type = rt
path = forum
# when configuring fields via config, they are indexed (and not stored) by default
rt_field = title
rt_field = content
# this option needs to be specified for the field to be stored
stored_fields = title, content
rt_attr_uint = author_id
rt_attr_uint = forum_id
rt_attr_timestamp = post_date
}
This example shows running a full-text query filtered by author_id, forum_id and sorted by post_date.
select * from forum where author_id=123 and forum_id in (1,3,7) order by post_date desc
POST /search
{
"index": "forum",
"query":
{
"match_all": {},
"bool":
{
"must":
[
{ "equals": { "author_id": 123 } },
{ "in": { "forum_id": [1,3,7] } }
]
}
},
"sort": [ { "post_date": "desc" } ]
}
$client->search([
'index' => 'forum',
'query' =>
[
'match_all' => [],
'bool' => [
'must' => [
'equals' => ['author_id' => 123],
'in' => [
'forum_id' => [
1,3,7
]
]
]
]
],
'sort' => [
['post_date' => 'desc']
]
]);
searchApi.search({"index":"forum","query":{"match_all":{},"bool":{"must":[{"equals":{"author_id":123}},{"in":{"forum_id":[1,3,7]}}]}},"sort":[{"post_date":"desc"}]})
res = await searchApi.search({"index":"forum","query":{"match_all":{},"bool":{"must":[{"equals":{"author_id":123}},{"in":{"forum_id":[1,3,7]}}]}},"sort":[{"post_date":"desc"}]});
HashMap<String,Object> filters = new HashMap<String,Object>(){{
put("must", new HashMap<String,Object>(){{
put("equals",new HashMap<String,Integer>(){{
put("author_id",123);
}});
put("in",
new HashMap<String,Object>(){{
put("forum_id",new int[] {1,3,7});
}});
}});
}};
Map<String,Object> query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("bool",filters);
SearchRequest searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
add(new HashMap<String,String>(){{ put("post_date","desc");}});
}});
SearchResponse searchResponse = searchApi.search(searchRequest);
object query = new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
var boolFilter = new BoolFilter();
boolFilter.Must = new List<Object> {
new EqualsFilter("author_id", 123),
new InFilter("forum_id", new List<Object> {1,3,7})
};
searchRequest.AttrFilter = boolFilter;
searchRequest.Sort = new List<Object> { new SortOrder("post_date", SortOrder.OrderEnum.Desc) };
var searchResponse = searchApi.Search(searchRequest);
Manticore supports two types of attribute storages:
As can be understood from their names, they store data differently. The traditional row-wise storage:
With the columnar storage:
The columnar storage was designed to handle large data volume that does not fit into RAM, so the recommendations are:
The traditional row-wise storage is the default, so if you want everything to be stored in a row-wise fashion, you don't need to do anything when you create a table.
To enable the columnar storage you need to:
engine='columnar' in CREATE TABLE to make all attributes of the table columnar. Then, if you want to keep a specific attribute row-wise, you need to add engine='rowwise' when you declare it. For example:create table tbl(title text, type int, price float engine='rowwise') engine='columnar'
engine='columnar' for a specific attribute in CREATE TABLE to make it columnar. For example:create table tbl(title text, type int, price float engine='columnar');
or
create table tbl(title text, type int, price float engine='columnar') engine='rowwise';
Below is the list of data types supported by Manticore Search:
The document identifier is a mandatory attribute, and document IDs must be unique 64-bit unsigned integers. Document IDs can be explicitly specified, but if not, they are still enabled. Document IDs cannot be updated. Note that when retrieving document IDs, they are treated as signed 64-bit integers, which means they may be negative. Use the UINT64() function to cast them to unsigned 64-bit integers if necessary.
CREATE TABLE tbl(id bigint, content text);
DESC tbl;
+---------+--------+----------------+
| Field | Type | Properties |
+---------+--------+----------------+
| id | bigint | |
| content | text | indexed stored |
+---------+--------+----------------+
2 rows in set (0.00 sec)
CREATE TABLE tbl(content text);
DESC tbl;
+---------+--------+----------------+
| Field | Type | Properties |
+---------+--------+----------------+
| id | bigint | |
| content | text | indexed stored |
+---------+--------+----------------+
2 rows in set (0.00 sec)
General syntax:
string|text [stored|attribute] [indexed]
Properties:
indexed - full-text indexed (can be used in full-text queries)stored - stored in a docstore (stored on disk, not in RAM, lazy read)attribute - makes it a string attribute (can sort/group by it)Specifying at least one property overrides all the default ones (see below), i.e., if you decide to use a custom combination of properties, you need to list all the properties you want.
No properties specified:
string and text are aliases, but if you don’t specify any properties, they by default mean different things:
string by default means attribute (see details below).text by default means stored + indexed (see details below).The text (just text or text/string indexed) data type forms the full-text part of the table. Text fields are indexed and can be searched for keywords.
Text is passed through an analyzer pipeline that converts the text to words, applies morphology transformations, etc. Eventually, a full-text table (a special data structure that enables quick searches for a keyword) gets built from that text.
Full-text fields can only be used in the MATCH() clause and cannot be used for sorting or aggregation. Words are stored in an inverted index along with references to the fields they belong to and positions in the field. This allows searching a word inside each field and using advanced operators like proximity. By default, the original text of the fields is both indexed and stored in document storage. It means that the original text can be returned with the query results and used in search result highlighting.
CREATE TABLE products(title text);
POST /cli -d "CREATE TABLE products(title text)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text']
]);
utilsApi.sql('CREATE TABLE products(title text)')
res = await utilsApi.sql('CREATE TABLE products(title text)');
utilsApi.sql("CREATE TABLE products(title text)");
utilsApi.Sql("CREATE TABLE products(title text)");
table products
{
type = rt
path = products
# when configuring fields via config, they are indexed (and not stored) by default
rt_field = title
# this option needs to be specified for the field to be stored
stored_fields = title
}
This behavior can be overridden by explicitly specifying that the text is only indexed.
CREATE TABLE products(title text indexed);
POST /cli -d "CREATE TABLE products(title text indexed)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text','options'=>['indexed']]
]);
utilsApi.sql('CREATE TABLE products(title text indexed)')
res = await utilsApi.sql('CREATE TABLE products(title text indexed)');
utilsApi.sql("CREATE TABLE products(title text indexed)");
utilsApi.Sql("CREATE TABLE products(title text indexed)");
table products
{
type = rt
path = products
# when configuring fields via config, they are indexed (and not stored) by default
rt_field = title
}
Fields are named, and you can limit your searches to a single field (e.g. search through "title" only) or a subset of fields (e.g. "title" and "abstract" only). You can have up to 256 full-text fields.
select * from products where match('@title first');
POST /search
{
"index": "products",
"query":
{
"match": { "title": "first" }
}
}
$index->setName('products')->search('@title')->get();
searchApi.search({"index":"products","query":{"match":{"title":"first"}}})
res = await searchApi.search({"index":"products","query":{"match":{"title":"first"}}});
utilsApi.sql("CREATE TABLE products(title text indexed)");
utilsApi.Sql("CREATE TABLE products(title text indexed)");
Unlike full-text fields, string attributes (just string or string/text attribute) are stored as they are received and cannot be used in full-text searches. Instead, they are returned in results, can be used in the WHERE clause for comparison filtering or REGEX, and can be used for sorting and aggregation. In general, it's not recommended to store large texts in string attributes, but use string attributes for metadata like names, titles, tags, keys.
If you want to also index the string attribute, you can specify both as string attribute indexed. It will allow full-text searching and works as an attribute.
CREATE TABLE products(title text, keys string);
POST /cli -d "CREATE TABLE products(title text, keys string)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'keys'=>['type'=>'string']
]);
utilsApi.sql('CREATE TABLE products(title text, keys string)')
res = await utilsApi.sql('CREATE TABLE products(title text, keys string)');
utilsApi.sql("CREATE TABLE products(title text, keys string)");
utilsApi.Sql("CREATE TABLE products(title text, keys string)");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_string = keys
}
CREATE TABLE products ( title string attribute indexed );
POST /cli -d "CREATE TABLE products ( title string attribute indexed )"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'string','options'=>['indexed','attribute']]
]);
utilsApi.sql('CREATE TABLE products ( title string attribute indexed )')
res = await utilsApi.sql('CREATE TABLE products ( title string attribute indexed )');
utilsApi.sql("CREATE TABLE products ( title string attribute indexed )");
utilsApi.Sql("CREATE TABLE products ( title string attribute indexed )");
table products
{
type = rt
path = products
rt_field = title
rt_attr_string = title
}
Integer type allows storing 32 bit unsigned integer values.
CREATE TABLE products(title text, price int);
POST /cli -d "CREATE TABLE products(title text, price int)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'int']
]);
utilsApi.sql('CREATE TABLE products(title text, price int)')
res = await utilsApi.sql('CREATE TABLE products(title text, price int)');
utilsApi.sql("CREATE TABLE products(title text, price int)");
utilsApi.Sql("CREATE TABLE products(title text, price int)");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_uint = type
}
Integers can be stored in shorter sizes than 32-bit by specifying a bit count. For example, if we want to store a numeric value which we know is not going to be bigger than 8, the type can be defined as bit(3). Bitcount integers perform slower than the full-size ones, but they require less RAM. They are saved in 32-bit chunks, so in order to save space, they should be grouped at the end of attribute definitions (otherwise a bitcount integer between 2 full-size integers will occupy 32 bits as well).
CREATE TABLE products(title text, flags bit(3), tags bit(2) );
POST /cli -d "CREATE TABLE products(title text, flags bit(3), tags bit(2))"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'flags'=>['type'=>'bit(3)'],
'tags'=>['type'=>'bit(2)']
]);
utilsApi.sql('CREATE TABLE products(title text, flags bit(3), tags bit(2) ')
res = await utilsApi.sql('CREATE TABLE products(title text, flags bit(3), tags bit(2) ');
utilsApi.sql("CREATE TABLE products(title text, flags bit(3), tags bit(2)");
utilsApi.Sql("CREATE TABLE products(title text, flags bit(3), tags bit(2)");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_uint = flags:3
rt_attr_uint = tags:2
}
Big integers (bigint) are 64-bit wide signed integers.
CREATE TABLE products(title text, price bigint );
POST /cli -d "CREATE TABLE products(title text, price bigint)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'bigint']
]);
utilsApi.sql('CREATE TABLE products(title text, price bigint )')
res = await utilsApi.sql('CREATE TABLE products(title text, price bigint )');
utilsApi.sql("CREATE TABLE products(title text, price bigint )");
utilsApi.Sql("CREATE TABLE products(title text, price bigint )");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_bigint = type
}
Declares a boolean attribute. It's equivalent to an integer attribute with bit count of 1.
CREATE TABLE products(title text, sold bool );
POST /cli -d "CREATE TABLE products(title text, sold bool)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'sold'=>['type'=>'bool']
]);
utilsApi.sql('CREATE TABLE products(title text, sold bool )')
res = await utilsApi.sql('CREATE TABLE products(title text, sold bool )');
utilsApi.sql("CREATE TABLE products(title text, sold bool )");
utilsApi.Sql("CREATE TABLE products(title text, sold bool )");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_bool = sold
}
Timestamp type represents unix timestamps which is stored as a 32-bit integer. The difference is that time and date functions are available for the timestamp type.
CREATE TABLE products(title text, date timestamp);
POST /cli -d "CREATE TABLE products(title text, date timestamp)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'date'=>['type'=>'timestamp']
]);
utilsApi.sql('CREATE TABLE products(title text, date timestamp)')
res = await utilsApi.sql('CREATE TABLE products(title text, date timestamp)');
utilsApi.sql("CREATE TABLE products(title text, date timestamp)");
utilsApi.Sql("CREATE TABLE products(title text, date timestamp)");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_timestamp = date
}
Real numbers are stored as 32-bit IEEE 754 single precision floats.
CREATE TABLE products(title text, coeff float);
POST /cli -d "CREATE TABLE products(title text, coeff float)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'coeff'=>['type'=>'float']
]);
utilsApi.sql('CREATE TABLE products(title text, coeff float)')
res = await utilsApi.sql('CREATE TABLE products(title text, coeff float)');
utilsApi.sql("CREATE TABLE products(title text, coeff float)");
utilsApi.Sql("CREATE TABLE products(title text, coeff float)");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_float = coeff
}
Unlike integer types, comparing two floating-point numbers for equality is not recommended due to potential rounding errors. A more reliable approach is to use a near-equal comparison, by checking the absolute error margin.
select abs(a-b)<=0.00001 from products
POST /search
{
"index": "products",
"query": { "match_all": {} } },
"expressions": { "eps": "abs(a-b)" }
}
$index->setName('products')->search('')->expression('eps','abs(a-b)')->get();
searchApi.search({"index":"products","query":{"match_all":{}},"expressions":{"eps":"abs(a-b)"}})
res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"eps":"abs(a-b)"}});
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
put("ebs","abs(a-b)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);
object query = new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object>{
new Dictionary<string, string> { {"ebs", "abs(a-b)"} }
};
var searchResponse = searchApi.Search(searchRequest);
Another alternative, which can also be used to perform IN(attr,val1,val2,val3) is to compare floats as integers by choosing a multiplier factor and convert the floats to integers in operations. The following example illustrates modifying IN(attr,2.0,2.5,3.5) to work with integer values.
select in(ceil(attr*100),200,250,350) from products
POST /search
{
"index": "products",
"query": { "match_all": {} } },
"expressions": { "inc": "in(ceil(attr*100),200,250,350)" }
}
$index->setName('products')->search('')->expression('inc','in(ceil(attr*100),200,250,350)')->get();
searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"inc":"in(ceil(attr*100),200,250,350)"}})
res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"inc":"in(ceil(attr*100),200,250,350)"}});
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
put("inc","in(ceil(attr*100),200,250,350)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);
object query = new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object> {
new Dictionary<string, string> { {"ebs", "in(ceil(attr*100),200,250,350)"} }
};
var searchResponse = searchApi.Search(searchRequest);
This data type allows storing JSON objects, which is useful for storing schema-less data. However, it is not supported by columnar storage. However, it can be stored in traditional storage, as it's possible to combine both storage types in the same table.
CREATE TABLE products(title text, data json);
POST /cli -d "CREATE TABLE products(title text, data json)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'data'=>['type'=>'json']
]);
utilsApi.sql('CREATE TABLE products(title text, data json)')
res = await utilsApi.sql('CREATE TABLE products(title text, data json)');
utilsApi.sql'CREATE TABLE products(title text, data json)');
utilsApi.Sql'CREATE TABLE products(title text, data json)');
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_json = data
}
JSON properties can be used in most operations. There are also special functions such as ALL(), ANY(), GREATEST(), LEAST() and INDEXOF() that allow traversal of property arrays.
select indexof(x>2 for x in data.intarray) from products
POST /search
{
"index": "products",
"query": { "match_all": {} } },
"expressions": { "idx": "indexof(x>2 for x in data.intarray)" }
}
$index->setName('products')->search('')->expression('idx','indexof(x>2 for x in data.intarray)')->get();
searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"idx":"indexof(x>2 for x in data.intarray)"}})
res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"idx":"indexof(x>2 for x in data.intarray)"}});
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
put("idx","indexof(x>2 for x in data.intarray)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);
object query = new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object> {
new Dictionary<string, string> { {"idx", "indexof(x>2 for x in data.intarray)"} }
};
var searchResponse = searchApi.Search(searchRequest);
Text properties are treated the same as strings, so it's not possible to use them in full-text match expressions. However, string functions such as REGEX() can be used.
select regex(data.name, 'est') as c from products where c>0
POST /search
{
"index": "products",
"query":
{
"match_all": {},
"range": { "c": { "gt": 0 } } }
},
"expressions": { "c": "regex(data.name, 'est')" }
}
$index->setName('products')->search('')->expression('idx',"regex(data.name, 'est')")->filter('c','gt',0)->get();
searchApi.search({"index":"products","query":{"match_all":{},"range":{"c":{"gt":0}}}},"expressions":{"c":"regex(data.name, 'est')"}})
res = await searchApi.search({"index":"products","query":{"match_all":{},"range":{"c":{"gt":0}}}},"expressions":{"c":"regex(data.name, 'est')"}});
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("range", new HashMap<String,Object>(){{
put("c", new HashMap<String,Object>(){{
put("gt",0);
}});
}});
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
put("idx","indexof(x>2 for x in data.intarray)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);
object query = new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
var rangeFilter = new RangeFilter("c");
rangeFilter.Gt = 0;
searchRequest.AttrFilter = rangeFilter;
searchRequest.Expressions = new List<Object> {
new Dictionary<string, string> { {"idx", "indexof(x>2 for x in data.intarray)"} }
};
var searchResponse = searchApi.Search(searchRequest);
In the case of JSON properties, enforcing data type may be required for proper functionality in certain situations. For example, when working with float values, DOUBLE() must be used for proper sorting.
select * from products order by double(data.myfloat) desc
POST /search
{
"index": "products",
"query": { "match_all": {} } },
"sort": [ { "double(data.myfloat)": { "order": "desc"} } ]
}
$index->setName('products')->search('')->sort('double(data.myfloat)','desc')->get();
searchApi.search({"index":"products","query":{"match_all":{}}},"sort":[{"double(data.myfloat)":{"order":"desc"}}]})
res = await searchApi.search({"index":"products","query":{"match_all":{}}},"sort":[{"double(data.myfloat)":{"order":"desc"}}]});
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
add(new HashMap<String,String>(){{ put("double(data.myfloat)",new HashMap<String,String>(){{ put("order","desc");}});}});
}});
searchResponse = searchApi.search(searchRequest);
object query = new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Sort = new List<Object> {
new SortOrder("double(data.myfloat)", SortOrder.OrderEnum.Desc)
};
var searchResponse = searchApi.Search(searchRequest);
Float vector attributes allow storing variable-length lists of floats. It's important to note that this concept differs from multi-valued attributes. Multi-valued attributes (MVAs) are essentially sets; they do not preserve value order, and duplicate values are not retained. In contrast, float vectors perform no additional processing on values during insertion.
Float vector attributes can be used in k-nearest neighbor searches; see KNN search.
** Currently, float_vector fields can only be utilized in KNN search within real-time tables and the data type is not supported in any other functions or expressions, nor is it supported in plain tables. **
CREATE TABLE products(title text, image_vector float_vector);
POST /cli -d "CREATE TABLE products(title text, image_vector float_vector)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'image_vector'=>['type'=>'float_vector']
]);
utilsApi.sql('CREATE TABLE products(title text, image_vector float_vector)')
res = await utilsApi.sql('CREATE TABLE products(title text, image_vector float_vector)');
utilsApi.sql("CREATE TABLE products(title text, image_vector float_vector)");
utilsApi.Sql("CREATE TABLE products(title text, image_vector float_vector)");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_float_vector = image_vector
}
Multi-value attributes allow storing variable-length lists of 32-bit unsigned integers. This can be useful for storing one-to-many numeric values, such as tags, product categories, and properties.
CREATE TABLE products(title text, product_codes multi);
POST /cli -d "CREATE TABLE products(title text, product_codes multi)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'product_codes'=>['type'=>'multi']
]);
utilsApi.sql('CREATE TABLE products(title text, product_codes multi)')
res = await utilsApi.sql('CREATE TABLE products(title text, product_codes multi)');
utilsApi.sql("CREATE TABLE products(title text, product_codes multi)");
utilsApi.Sql("CREATE TABLE products(title text, product_codes multi)");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_multi = product_codes
}
It supports filtering and aggregation, but not sorting. Filtering can be done using a condition that requires at least one element to pass (using ANY()) or all elements (ALL()) to pass.
select * from products where any(product_codes)=3
POST /search
{
"index": "products",
"query":
{
"match_all": {},
"equals" : { "any(product_codes)": 3 }
}
}
$index->setName('products')->search('')->filter('any(product_codes)','equals',3)->get();
searchApi.search({"index":"products","query":{"match_all":{},"equals":{"any(product_codes)":3}}}})
res = await searchApi.search({"index":"products","query":{"match_all":{},"equals":{"any(product_codes)":3}}}})'
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("equals",new HashMap<String,Integer>(){{
put("any(product_codes)",3);
}});
searchRequest.setQuery(query);
searchResponse = searchApi.search(searchRequest);
object query = new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.AttrFilter = new EqualsFilter("any(product_codes)", 3);
var searchResponse = searchApi.Search(searchRequest);
Information like least or greatest element and length of the list can be extracted. An example shows ordering by the least element of a multi-value attribute.
select least(product_codes) l from products order by l asc
POST /search
{
"index": "products",
"query":
{
"match_all": {},
"sort": [ { "product_codes":{ "order":"asc", "mode":"min" } } ]
}
}
$index->setName('products')->search('')->sort('product_codes','asc','min')->get();
searchApi.search({"index":"products","query":{"match_all":{},"sort":[{"product_codes":{"order":"asc","mode":"min"}}]}})
res = await searchApi.search({"index":"products","query":{"match_all":{},"sort":[{"product_codes":{"order":"asc","mode":"min"}}]}});
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
add(new HashMap<String,String>(){{ put("product_codes",new HashMap<String,String>(){{ put("order","asc");put("mode","min");}});}});
}});
searchResponse = searchApi.search(searchRequest);
object query = new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Sort = new List<Object> {
new SortMVA("product_codes", SortOrder.OrderEnum.Asc, SortMVA.ModeEnum.Min)
};
searchResponse = searchApi.search(searchRequest);
When grouping by a multi-value attribute, a document will contribute to as many groups as there are different values associated with that document. For instance, if a collection contains exactly one document having a 'product_codes' multi-value attribute with values 5, 7, and 11, grouping on 'product_codes' will produce 3 groups with COUNT(*)equal to 1 and GROUPBY() key values of 5, 7, and 11, respectively. Also, note that grouping by multi-value attributes may lead to duplicate documents in the result set because each document can participate in many groups.
insert into products values ( 1, 'doc one', (5,7,11) );
select id, count(*), groupby() from products group by product_codes;
Query OK, 1 row affected (0.00 sec)
+------+----------+-----------+
| id | count(*) | groupby() |
+------+----------+-----------+
| 1 | 1 | 11 |
| 1 | 1 | 7 |
| 1 | 1 | 5 |
+------+----------+-----------+
3 rows in set (0.00 sec)
The order of the numbers inserted as values of multivalued attributes is not preserved. Values are stored internally as a sorted set.
insert into product values (1,'first',(4,2,1,3));
select * from products;
Query OK, 1 row affected (0.00 sec)
+------+---------------+-------+
| id | product_codes | title |
+------+---------------+-------+
| 1 | 1,2,3,4 | first |
+------+---------------+-------+
1 row in set (0.01 sec)
POST /insert
{
"index":"products",
"id":1,
"doc":
{
"title":"first",
"product_codes":[4,2,1,3]
}
}
POST /search
{
"index": "products",
"query": { "match_all": {} }
}
{
"_index":"products",
"_id":1,
"created":true,
"result":"created",
"status":201
}
{
"took":0,
"timed_out":false,
"hits":{
"total":1,
"hits":[
{
"_id":"1",
"_score":1,
"_source":{
"product_codes":[
1,
2,
3,
4
],
"title":"first"
}
}
]
}
}
$index->addDocument([
"title"=>"first",
"product_codes"=>[4,2,1,3]
]);
$index->search('')-get();
Array
(
[_index] => products
[_id] => 1
[created] => 1
[result] => created
[status] => 201
)
Array
(
[took] => 0
[timed_out] =>
[hits] => Array
(
[total] => 1
[hits] => Array
(
[0] => Array
(
[_id] => 1
[_score] => 1
[_source] => Array
(
[product_codes] => Array
(
[0] => 1
[1] => 2
[2] => 3
[3] => 4
)
[title] => first
)
)
)
)
)
indexApi.insert({"index":"products","id":1,"doc":{"title":"first","product_codes":[4,2,1,3]}})
searchApi.search({"index":"products","query":{"match_all":{}}})
{'created': True,
'found': None,
'id': 1,
'index': 'products',
'result': 'created'}
{'hits': {'hits': [{u'_id': u'1',
u'_score': 1,
u'_source': {u'product_codes': [1, 2, 3, 4],
u'title': u'first'}}],
'total': 1},
'profile': None,
'timed_out': False,
'took': 29}
await indexApi.insert({"index":"products","id":1,"doc":{"title":"first","product_codes":[4,2,1,3]}});
res = await searchApi.search({"index":"products","query":{"match_all":{}}});
{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"1","_score":1,"_source":{"product_codes":[1,2,3,4],"title":"first"}}]}}
InsertDocumentRequest newdoc = new InsertDocumentRequest();
HashMap<String,Object> doc = new HashMap<String,Object>(){{
put("title","first");
put("product_codes",new int[] {4,2,1,3});
}};
newdoc.index("products").id(1L).setDoc(doc);
sqlresult = indexApi.insert(newdoc);
Map<String,Object> query = new HashMap<String,Object>();
query.put("match_all",null);
SearchRequest searchRequest = new SearchRequest();
searchRequest.setIndex("products");
searchRequest.setQuery(query);
SearchResponse searchResponse = searchApi.search(searchRequest);
System.out.println(searchResponse.toString() );
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 1
hits: [{_id=1, _score=1, _source={product_codes=[1, 2, 3, 4], title=first}}]
aggregations: null
}
profile: null
}
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("title", "first");
doc.Add("product_codes", new List<Object> {4,2,1,3});
InsertDocumentRequest newdoc = new InsertDocumentRequest(index: "products", id: 1, doc: doc);
var sqlresult = indexApi.Insert(newdoc);
object query = new { match_all=null };
var searchRequest = new SearchRequest("products", query);
var searchResponse = searchApi.Search(searchRequest);
Console.WriteLine(searchResponse.ToString())
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 1
hits: [{_id=1, _score=1, _source={product_codes=[1, 2, 3, 4], title=first}}]
aggregations: null
}
profile: null
}
A data type that allows storing variable-length lists of 64-bit signed integers. It has the same functionality as multi-value integer.
CREATE TABLE products(title text, values multi64);
POST /cli -d "CREATE TABLE products(title text, values multi64)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'values'=>['type'=>'multi64']
]);
utilsApi.sql('CREATE TABLE products(title text, values multi64))')
res = await utilsApi.sql('CREATE TABLE products(title text, values multi64))');
utilsApi.sql("CREATE TABLE products(title text, values multi64))");
utilsApi.Sql("CREATE TABLE products(title text, values multi64))");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_multi_64 = values
}
When you use the columnar storage you can specify the following properties for the attributes.
By default, Manticore Columnar storage stores all attributes in a columnar fashion, as well as in a special docstore row by row. This enables fast execution of queries like SELECT * FROM ..., especially when fetching a large number of records at once. However, if you are sure that you do not need it or wish to save disk space, you can disable it by specifying fast_fetch='0' when creating a table or (if you are defining a table in a config) by using columnar_no_fast_fetch as shown in the following example.
create table t(a int, b int fast_fetch='0') engine='columnar'; desc t;
+-------+--------+---------------------+
| Field | Type | Properties |
+-------+--------+---------------------+
| id | bigint | columnar fast_fetch |
| a | uint | columnar fast_fetch |
| b | uint | columnar |
+-------+--------+---------------------+
3 rows in set (0.00 sec)
source min {
type = mysql
sql_host = localhost
sql_user = test
sql_pass =
sql_db = test
sql_query = select 1, 1 a, 1 b
sql_attr_uint = a
sql_attr_uint = b
}
table tbl {
path = tbl/col
source = min
columnar_attrs = *
columnar_no_fast_fetch = b
}
+-------+--------+---------------------+
| Field | Type | Properties |
+-------+--------+---------------------+
| id | bigint | columnar fast_fetch |
| a | uint | columnar fast_fetch |
| b | uint | columnar |
+-------+--------+---------------------+
Manticore's data types can be split into two categories: full-text fields and attributes.
Full-text fields:
Full-text fields are represented by the data type text. All other data types are called "attributes".
Attributes are non-full-text values associated with each document that can be used to perform non-full-text filtering, sorting and grouping during a search.
It is often desired to process full-text search results based not only on matching document ID and its rank, but also on a number of other per-document values. For example, one might need to sort news search results by date and then relevance, or search through products within a specified price range, or limit a blog search to posts made by selected users, or group results by month. To do this efficiently, Manticore enables not only full-text fields, but also additional attributes to be added to each document. These attributes can be used to filter, sort, or group full-text matches, or to search only by attributes.
The attributes, unlike full-text fields, are not full-text indexed. They are stored in the table, but it is not possible to search them as full-text.
A good example for attributes would be a forum posts table. Assume that only the title and content fields need to be full-text searchable - but that sometimes it is also required to limit search to a certain author or a sub-forum (i.e., search only those rows that have some specific values of author_id or forum_id); or to sort matches by post_date column; or to group matching posts by month of the post_date and calculate per-group match counts.
CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp);
POST /cli -d "CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)"
$index = new \Manticoresearch\Index($client);
$index->setName('forum');
$index->create([
'title'=>['type'=>'text'],
'content'=>['type'=>'text'],
'author_id'=>['type'=>'int'],
'forum_id'=>['type'=>'int'],
'post_date'=>['type'=>'timestamp']
]);
utilsApi.sql('CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)')
res = await utilsApi.sql('CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)');
utilsApi.sql("CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)");
utilsApi.Sql("CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)");
table forum
{
type = rt
path = forum
# when configuring fields via config, they are indexed (and not stored) by default
rt_field = title
rt_field = content
# this option needs to be specified for the field to be stored
stored_fields = title, content
rt_attr_uint = author_id
rt_attr_uint = forum_id
rt_attr_timestamp = post_date
}
This example shows running a full-text query filtered by author_id, forum_id and sorted by post_date.
select * from forum where author_id=123 and forum_id in (1,3,7) order by post_date desc
POST /search
{
"index": "forum",
"query":
{
"match_all": {},
"bool":
{
"must":
[
{ "equals": { "author_id": 123 } },
{ "in": { "forum_id": [1,3,7] } }
]
}
},
"sort": [ { "post_date": "desc" } ]
}
$client->search([
'index' => 'forum',
'query' =>
[
'match_all' => [],
'bool' => [
'must' => [
'equals' => ['author_id' => 123],
'in' => [
'forum_id' => [
1,3,7
]
]
]
]
],
'sort' => [
['post_date' => 'desc']
]
]);
searchApi.search({"index":"forum","query":{"match_all":{},"bool":{"must":[{"equals":{"author_id":123}},{"in":{"forum_id":[1,3,7]}}]}},"sort":[{"post_date":"desc"}]})
res = await searchApi.search({"index":"forum","query":{"match_all":{},"bool":{"must":[{"equals":{"author_id":123}},{"in":{"forum_id":[1,3,7]}}]}},"sort":[{"post_date":"desc"}]});
HashMap<String,Object> filters = new HashMap<String,Object>(){{
put("must", new HashMap<String,Object>(){{
put("equals",new HashMap<String,Integer>(){{
put("author_id",123);
}});
put("in",
new HashMap<String,Object>(){{
put("forum_id",new int[] {1,3,7});
}});
}});
}};
Map<String,Object> query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("bool",filters);
SearchRequest searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
add(new HashMap<String,String>(){{ put("post_date","desc");}});
}});
SearchResponse searchResponse = searchApi.search(searchRequest);
object query = new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
var boolFilter = new BoolFilter();
boolFilter.Must = new List<Object> {
new EqualsFilter("author_id", 123),
new InFilter("forum_id", new List<Object> {1,3,7})
};
searchRequest.AttrFilter = boolFilter;
searchRequest.Sort = new List<Object> { new SortOrder("post_date", SortOrder.OrderEnum.Desc) };
var searchResponse = searchApi.Search(searchRequest);
Manticore supports two types of attribute storages:
As can be understood from their names, they store data differently. The traditional row-wise storage:
With the columnar storage:
The columnar storage was designed to handle large data volume that does not fit into RAM, so the recommendations are:
The traditional row-wise storage is the default, so if you want everything to be stored in a row-wise fashion, you don't need to do anything when you create a table.
To enable the columnar storage you need to:
engine='columnar' in CREATE TABLE to make all attributes of the table columnar. Then, if you want to keep a specific attribute row-wise, you need to add engine='rowwise' when you declare it. For example:create table tbl(title text, type int, price float engine='rowwise') engine='columnar'
engine='columnar' for a specific attribute in CREATE TABLE to make it columnar. For example:create table tbl(title text, type int, price float engine='columnar');
or
create table tbl(title text, type int, price float engine='columnar') engine='rowwise';
Below is the list of data types supported by Manticore Search:
The document identifier is a mandatory attribute, and document IDs must be unique 64-bit unsigned integers. Document IDs can be explicitly specified, but if not, they are still enabled. Document IDs cannot be updated. Note that when retrieving document IDs, they are treated as signed 64-bit integers, which means they may be negative. Use the UINT64() function to cast them to unsigned 64-bit integers if necessary.
CREATE TABLE tbl(id bigint, content text);
DESC tbl;
+---------+--------+----------------+
| Field | Type | Properties |
+---------+--------+----------------+
| id | bigint | |
| content | text | indexed stored |
+---------+--------+----------------+
2 rows in set (0.00 sec)
CREATE TABLE tbl(content text);
DESC tbl;
+---------+--------+----------------+
| Field | Type | Properties |
+---------+--------+----------------+
| id | bigint | |
| content | text | indexed stored |
+---------+--------+----------------+
2 rows in set (0.00 sec)
General syntax:
string|text [stored|attribute] [indexed]
Properties:
indexed - full-text indexed (can be used in full-text queries)stored - stored in a docstore (stored on disk, not in RAM, lazy read)attribute - makes it a string attribute (can sort/group by it)Specifying at least one property overrides all the default ones (see below), i.e., if you decide to use a custom combination of properties, you need to list all the properties you want.
No properties specified:
string and text are aliases, but if you don’t specify any properties, they by default mean different things:
string by default means attribute (see details below).text by default means stored + indexed (see details below).The text (just text or text/string indexed) data type forms the full-text part of the table. Text fields are indexed and can be searched for keywords.
Text is passed through an analyzer pipeline that converts the text to words, applies morphology transformations, etc. Eventually, a full-text table (a special data structure that enables quick searches for a keyword) gets built from that text.
Full-text fields can only be used in the MATCH() clause and cannot be used for sorting or aggregation. Words are stored in an inverted index along with references to the fields they belong to and positions in the field. This allows searching a word inside each field and using advanced operators like proximity. By default, the original text of the fields is both indexed and stored in document storage. It means that the original text can be returned with the query results and used in search result highlighting.
CREATE TABLE products(title text);
POST /cli -d "CREATE TABLE products(title text)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text']
]);
utilsApi.sql('CREATE TABLE products(title text)')
res = await utilsApi.sql('CREATE TABLE products(title text)');
utilsApi.sql("CREATE TABLE products(title text)");
utilsApi.Sql("CREATE TABLE products(title text)");
table products
{
type = rt
path = products
# when configuring fields via config, they are indexed (and not stored) by default
rt_field = title
# this option needs to be specified for the field to be stored
stored_fields = title
}
This behavior can be overridden by explicitly specifying that the text is only indexed.
CREATE TABLE products(title text indexed);
POST /cli -d "CREATE TABLE products(title text indexed)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text','options'=>['indexed']]
]);
utilsApi.sql('CREATE TABLE products(title text indexed)')
res = await utilsApi.sql('CREATE TABLE products(title text indexed)');
utilsApi.sql("CREATE TABLE products(title text indexed)");
utilsApi.Sql("CREATE TABLE products(title text indexed)");
table products
{
type = rt
path = products
# when configuring fields via config, they are indexed (and not stored) by default
rt_field = title
}
Fields are named, and you can limit your searches to a single field (e.g. search through "title" only) or a subset of fields (e.g. "title" and "abstract" only). You can have up to 256 full-text fields.
select * from products where match('@title first');
POST /search
{
"index": "products",
"query":
{
"match": { "title": "first" }
}
}
$index->setName('products')->search('@title')->get();
searchApi.search({"index":"products","query":{"match":{"title":"first"}}})
res = await searchApi.search({"index":"products","query":{"match":{"title":"first"}}});
utilsApi.sql("CREATE TABLE products(title text indexed)");
utilsApi.Sql("CREATE TABLE products(title text indexed)");
Unlike full-text fields, string attributes (just string or string/text attribute) are stored as they are received and cannot be used in full-text searches. Instead, they are returned in results, can be used in the WHERE clause for comparison filtering or REGEX, and can be used for sorting and aggregation. In general, it's not recommended to store large texts in string attributes, but use string attributes for metadata like names, titles, tags, keys.
If you want to also index the string attribute, you can specify both as string attribute indexed. It will allow full-text searching and works as an attribute.
CREATE TABLE products(title text, keys string);
POST /cli -d "CREATE TABLE products(title text, keys string)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'keys'=>['type'=>'string']
]);
utilsApi.sql('CREATE TABLE products(title text, keys string)')
res = await utilsApi.sql('CREATE TABLE products(title text, keys string)');
utilsApi.sql("CREATE TABLE products(title text, keys string)");
utilsApi.Sql("CREATE TABLE products(title text, keys string)");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_string = keys
}
CREATE TABLE products ( title string attribute indexed );
POST /cli -d "CREATE TABLE products ( title string attribute indexed )"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'string','options'=>['indexed','attribute']]
]);
utilsApi.sql('CREATE TABLE products ( title string attribute indexed )')
res = await utilsApi.sql('CREATE TABLE products ( title string attribute indexed )');
utilsApi.sql("CREATE TABLE products ( title string attribute indexed )");
utilsApi.Sql("CREATE TABLE products ( title string attribute indexed )");
table products
{
type = rt
path = products
rt_field = title
rt_attr_string = title
}
Integer type allows storing 32 bit unsigned integer values.
CREATE TABLE products(title text, price int);
POST /cli -d "CREATE TABLE products(title text, price int)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'int']
]);
utilsApi.sql('CREATE TABLE products(title text, price int)')
res = await utilsApi.sql('CREATE TABLE products(title text, price int)');
utilsApi.sql("CREATE TABLE products(title text, price int)");
utilsApi.Sql("CREATE TABLE products(title text, price int)");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_uint = type
}
Integers can be stored in shorter sizes than 32-bit by specifying a bit count. For example, if we want to store a numeric value which we know is not going to be bigger than 8, the type can be defined as bit(3). Bitcount integers perform slower than the full-size ones, but they require less RAM. They are saved in 32-bit chunks, so in order to save space, they should be grouped at the end of attribute definitions (otherwise a bitcount integer between 2 full-size integers will occupy 32 bits as well).
CREATE TABLE products(title text, flags bit(3), tags bit(2) );
POST /cli -d "CREATE TABLE products(title text, flags bit(3), tags bit(2))"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'flags'=>['type'=>'bit(3)'],
'tags'=>['type'=>'bit(2)']
]);
utilsApi.sql('CREATE TABLE products(title text, flags bit(3), tags bit(2) ')
res = await utilsApi.sql('CREATE TABLE products(title text, flags bit(3), tags bit(2) ');
utilsApi.sql("CREATE TABLE products(title text, flags bit(3), tags bit(2)");
utilsApi.Sql("CREATE TABLE products(title text, flags bit(3), tags bit(2)");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_uint = flags:3
rt_attr_uint = tags:2
}
Big integers (bigint) are 64-bit wide signed integers.
CREATE TABLE products(title text, price bigint );
POST /cli -d "CREATE TABLE products(title text, price bigint)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'bigint']
]);
utilsApi.sql('CREATE TABLE products(title text, price bigint )')
res = await utilsApi.sql('CREATE TABLE products(title text, price bigint )');
utilsApi.sql("CREATE TABLE products(title text, price bigint )");
utilsApi.Sql("CREATE TABLE products(title text, price bigint )");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_bigint = type
}
Declares a boolean attribute. It's equivalent to an integer attribute with bit count of 1.
CREATE TABLE products(title text, sold bool );
POST /cli -d "CREATE TABLE products(title text, sold bool)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'sold'=>['type'=>'bool']
]);
utilsApi.sql('CREATE TABLE products(title text, sold bool )')
res = await utilsApi.sql('CREATE TABLE products(title text, sold bool )');
utilsApi.sql("CREATE TABLE products(title text, sold bool )");
utilsApi.Sql("CREATE TABLE products(title text, sold bool )");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_bool = sold
}
Timestamp type represents unix timestamps which is stored as a 32-bit integer. The difference is that time and date functions are available for the timestamp type.
CREATE TABLE products(title text, date timestamp);
POST /cli -d "CREATE TABLE products(title text, date timestamp)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'date'=>['type'=>'timestamp']
]);
utilsApi.sql('CREATE TABLE products(title text, date timestamp)')
res = await utilsApi.sql('CREATE TABLE products(title text, date timestamp)');
utilsApi.sql("CREATE TABLE products(title text, date timestamp)");
utilsApi.Sql("CREATE TABLE products(title text, date timestamp)");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_timestamp = date
}
Real numbers are stored as 32-bit IEEE 754 single precision floats.
CREATE TABLE products(title text, coeff float);
POST /cli -d "CREATE TABLE products(title text, coeff float)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'coeff'=>['type'=>'float']
]);
utilsApi.sql('CREATE TABLE products(title text, coeff float)')
res = await utilsApi.sql('CREATE TABLE products(title text, coeff float)');
utilsApi.sql("CREATE TABLE products(title text, coeff float)");
utilsApi.Sql("CREATE TABLE products(title text, coeff float)");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_float = coeff
}
Unlike integer types, comparing two floating-point numbers for equality is not recommended due to potential rounding errors. A more reliable approach is to use a near-equal comparison, by checking the absolute error margin.
select abs(a-b)<=0.00001 from products
POST /search
{
"index": "products",
"query": { "match_all": {} } },
"expressions": { "eps": "abs(a-b)" }
}
$index->setName('products')->search('')->expression('eps','abs(a-b)')->get();
searchApi.search({"index":"products","query":{"match_all":{}},"expressions":{"eps":"abs(a-b)"}})
res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"eps":"abs(a-b)"}});
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
put("ebs","abs(a-b)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);
object query = new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object>{
new Dictionary<string, string> { {"ebs", "abs(a-b)"} }
};
var searchResponse = searchApi.Search(searchRequest);
Another alternative, which can also be used to perform IN(attr,val1,val2,val3) is to compare floats as integers by choosing a multiplier factor and convert the floats to integers in operations. The following example illustrates modifying IN(attr,2.0,2.5,3.5) to work with integer values.
select in(ceil(attr*100),200,250,350) from products
POST /search
{
"index": "products",
"query": { "match_all": {} } },
"expressions": { "inc": "in(ceil(attr*100),200,250,350)" }
}
$index->setName('products')->search('')->expression('inc','in(ceil(attr*100),200,250,350)')->get();
searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"inc":"in(ceil(attr*100),200,250,350)"}})
res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"inc":"in(ceil(attr*100),200,250,350)"}});
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
put("inc","in(ceil(attr*100),200,250,350)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);
object query = new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object> {
new Dictionary<string, string> { {"ebs", "in(ceil(attr*100),200,250,350)"} }
};
var searchResponse = searchApi.Search(searchRequest);
This data type allows storing JSON objects, which is useful for storing schema-less data. However, it is not supported by columnar storage. However, it can be stored in traditional storage, as it's possible to combine both storage types in the same table.
CREATE TABLE products(title text, data json);
POST /cli -d "CREATE TABLE products(title text, data json)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'data'=>['type'=>'json']
]);
utilsApi.sql('CREATE TABLE products(title text, data json)')
res = await utilsApi.sql('CREATE TABLE products(title text, data json)');
utilsApi.sql'CREATE TABLE products(title text, data json)');
utilsApi.Sql'CREATE TABLE products(title text, data json)');
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_json = data
}
JSON properties can be used in most operations. There are also special functions such as ALL(), ANY(), GREATEST(), LEAST() and INDEXOF() that allow traversal of property arrays.
select indexof(x>2 for x in data.intarray) from products
POST /search
{
"index": "products",
"query": { "match_all": {} } },
"expressions": { "idx": "indexof(x>2 for x in data.intarray)" }
}
$index->setName('products')->search('')->expression('idx','indexof(x>2 for x in data.intarray)')->get();
searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"idx":"indexof(x>2 for x in data.intarray)"}})
res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"idx":"indexof(x>2 for x in data.intarray)"}});
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
put("idx","indexof(x>2 for x in data.intarray)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);
object query = new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object> {
new Dictionary<string, string> { {"idx", "indexof(x>2 for x in data.intarray)"} }
};
var searchResponse = searchApi.Search(searchRequest);
Text properties are treated the same as strings, so it's not possible to use them in full-text match expressions. However, string functions such as REGEX() can be used.
select regex(data.name, 'est') as c from products where c>0
POST /search
{
"index": "products",
"query":
{
"match_all": {},
"range": { "c": { "gt": 0 } } }
},
"expressions": { "c": "regex(data.name, 'est')" }
}
$index->setName('products')->search('')->expression('idx',"regex(data.name, 'est')")->filter('c','gt',0)->get();
searchApi.search({"index":"products","query":{"match_all":{},"range":{"c":{"gt":0}}}},"expressions":{"c":"regex(data.name, 'est')"}})
res = await searchApi.search({"index":"products","query":{"match_all":{},"range":{"c":{"gt":0}}}},"expressions":{"c":"regex(data.name, 'est')"}});
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("range", new HashMap<String,Object>(){{
put("c", new HashMap<String,Object>(){{
put("gt",0);
}});
}});
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
put("idx","indexof(x>2 for x in data.intarray)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);
object query = new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
var rangeFilter = new RangeFilter("c");
rangeFilter.Gt = 0;
searchRequest.AttrFilter = rangeFilter;
searchRequest.Expressions = new List<Object> {
new Dictionary<string, string> { {"idx", "indexof(x>2 for x in data.intarray)"} }
};
var searchResponse = searchApi.Search(searchRequest);
In the case of JSON properties, enforcing data type may be required for proper functionality in certain situations. For example, when working with float values, DOUBLE() must be used for proper sorting.
select * from products order by double(data.myfloat) desc
POST /search
{
"index": "products",
"query": { "match_all": {} } },
"sort": [ { "double(data.myfloat)": { "order": "desc"} } ]
}
$index->setName('products')->search('')->sort('double(data.myfloat)','desc')->get();
searchApi.search({"index":"products","query":{"match_all":{}}},"sort":[{"double(data.myfloat)":{"order":"desc"}}]})
res = await searchApi.search({"index":"products","query":{"match_all":{}}},"sort":[{"double(data.myfloat)":{"order":"desc"}}]});
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
add(new HashMap<String,String>(){{ put("double(data.myfloat)",new HashMap<String,String>(){{ put("order","desc");}});}});
}});
searchResponse = searchApi.search(searchRequest);
object query = new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Sort = new List<Object> {
new SortOrder("double(data.myfloat)", SortOrder.OrderEnum.Desc)
};
var searchResponse = searchApi.Search(searchRequest);
Float vector attributes allow storing variable-length lists of floats. It's important to note that this concept differs from multi-valued attributes. Multi-valued attributes (MVAs) are essentially sets; they do not preserve value order, and duplicate values are not retained. In contrast, float vectors perform no additional processing on values during insertion.
Float vector attributes can be used in k-nearest neighbor searches; see KNN search.
** Currently, float_vector fields can only be utilized in KNN search within real-time tables and the data type is not supported in any other functions or expressions, nor is it supported in plain tables. **
CREATE TABLE products(title text, image_vector float_vector);
POST /cli -d "CREATE TABLE products(title text, image_vector float_vector)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'image_vector'=>['type'=>'float_vector']
]);
utilsApi.sql('CREATE TABLE products(title text, image_vector float_vector)')
res = await utilsApi.sql('CREATE TABLE products(title text, image_vector float_vector)');
utilsApi.sql("CREATE TABLE products(title text, image_vector float_vector)");
utilsApi.Sql("CREATE TABLE products(title text, image_vector float_vector)");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_float_vector = image_vector
}
Multi-value attributes allow storing variable-length lists of 32-bit unsigned integers. This can be useful for storing one-to-many numeric values, such as tags, product categories, and properties.
CREATE TABLE products(title text, product_codes multi);
POST /cli -d "CREATE TABLE products(title text, product_codes multi)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'product_codes'=>['type'=>'multi']
]);
utilsApi.sql('CREATE TABLE products(title text, product_codes multi)')
res = await utilsApi.sql('CREATE TABLE products(title text, product_codes multi)');
utilsApi.sql("CREATE TABLE products(title text, product_codes multi)");
utilsApi.Sql("CREATE TABLE products(title text, product_codes multi)");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_multi = product_codes
}
It supports filtering and aggregation, but not sorting. Filtering can be done using a condition that requires at least one element to pass (using ANY()) or all elements (ALL()) to pass.
select * from products where any(product_codes)=3
POST /search
{
"index": "products",
"query":
{
"match_all": {},
"equals" : { "any(product_codes)": 3 }
}
}
$index->setName('products')->search('')->filter('any(product_codes)','equals',3)->get();
searchApi.search({"index":"products","query":{"match_all":{},"equals":{"any(product_codes)":3}}}})
res = await searchApi.search({"index":"products","query":{"match_all":{},"equals":{"any(product_codes)":3}}}})'
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("equals",new HashMap<String,Integer>(){{
put("any(product_codes)",3);
}});
searchRequest.setQuery(query);
searchResponse = searchApi.search(searchRequest);
object query = new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.AttrFilter = new EqualsFilter("any(product_codes)", 3);
var searchResponse = searchApi.Search(searchRequest);
Information like least or greatest element and length of the list can be extracted. An example shows ordering by the least element of a multi-value attribute.
select least(product_codes) l from products order by l asc
POST /search
{
"index": "products",
"query":
{
"match_all": {},
"sort": [ { "product_codes":{ "order":"asc", "mode":"min" } } ]
}
}
$index->setName('products')->search('')->sort('product_codes','asc','min')->get();
searchApi.search({"index":"products","query":{"match_all":{},"sort":[{"product_codes":{"order":"asc","mode":"min"}}]}})
res = await searchApi.search({"index":"products","query":{"match_all":{},"sort":[{"product_codes":{"order":"asc","mode":"min"}}]}});
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
add(new HashMap<String,String>(){{ put("product_codes",new HashMap<String,String>(){{ put("order","asc");put("mode","min");}});}});
}});
searchResponse = searchApi.search(searchRequest);
object query = new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Sort = new List<Object> {
new SortMVA("product_codes", SortOrder.OrderEnum.Asc, SortMVA.ModeEnum.Min)
};
searchResponse = searchApi.search(searchRequest);
When grouping by a multi-value attribute, a document will contribute to as many groups as there are different values associated with that document. For instance, if a collection contains exactly one document having a 'product_codes' multi-value attribute with values 5, 7, and 11, grouping on 'product_codes' will produce 3 groups with COUNT(*)equal to 1 and GROUPBY() key values of 5, 7, and 11, respectively. Also, note that grouping by multi-value attributes may lead to duplicate documents in the result set because each document can participate in many groups.
insert into products values ( 1, 'doc one', (5,7,11) );
select id, count(*), groupby() from products group by product_codes;
Query OK, 1 row affected (0.00 sec)
+------+----------+-----------+
| id | count(*) | groupby() |
+------+----------+-----------+
| 1 | 1 | 11 |
| 1 | 1 | 7 |
| 1 | 1 | 5 |
+------+----------+-----------+
3 rows in set (0.00 sec)
The order of the numbers inserted as values of multivalued attributes is not preserved. Values are stored internally as a sorted set.
insert into product values (1,'first',(4,2,1,3));
select * from products;
Query OK, 1 row affected (0.00 sec)
+------+---------------+-------+
| id | product_codes | title |
+------+---------------+-------+
| 1 | 1,2,3,4 | first |
+------+---------------+-------+
1 row in set (0.01 sec)
POST /insert
{
"index":"products",
"id":1,
"doc":
{
"title":"first",
"product_codes":[4,2,1,3]
}
}
POST /search
{
"index": "products",
"query": { "match_all": {} }
}
{
"_index":"products",
"_id":1,
"created":true,
"result":"created",
"status":201
}
{
"took":0,
"timed_out":false,
"hits":{
"total":1,
"hits":[
{
"_id":"1",
"_score":1,
"_source":{
"product_codes":[
1,
2,
3,
4
],
"title":"first"
}
}
]
}
}
$index->addDocument([
"title"=>"first",
"product_codes"=>[4,2,1,3]
]);
$index->search('')-get();
Array
(
[_index] => products
[_id] => 1
[created] => 1
[result] => created
[status] => 201
)
Array
(
[took] => 0
[timed_out] =>
[hits] => Array
(
[total] => 1
[hits] => Array
(
[0] => Array
(
[_id] => 1
[_score] => 1
[_source] => Array
(
[product_codes] => Array
(
[0] => 1
[1] => 2
[2] => 3
[3] => 4
)
[title] => first
)
)
)
)
)
indexApi.insert({"index":"products","id":1,"doc":{"title":"first","product_codes":[4,2,1,3]}})
searchApi.search({"index":"products","query":{"match_all":{}}})
{'created': True,
'found': None,
'id': 1,
'index': 'products',
'result': 'created'}
{'hits': {'hits': [{u'_id': u'1',
u'_score': 1,
u'_source': {u'product_codes': [1, 2, 3, 4],
u'title': u'first'}}],
'total': 1},
'profile': None,
'timed_out': False,
'took': 29}
await indexApi.insert({"index":"products","id":1,"doc":{"title":"first","product_codes":[4,2,1,3]}});
res = await searchApi.search({"index":"products","query":{"match_all":{}}});
{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"1","_score":1,"_source":{"product_codes":[1,2,3,4],"title":"first"}}]}}
InsertDocumentRequest newdoc = new InsertDocumentRequest();
HashMap<String,Object> doc = new HashMap<String,Object>(){{
put("title","first");
put("product_codes",new int[] {4,2,1,3});
}};
newdoc.index("products").id(1L).setDoc(doc);
sqlresult = indexApi.insert(newdoc);
Map<String,Object> query = new HashMap<String,Object>();
query.put("match_all",null);
SearchRequest searchRequest = new SearchRequest();
searchRequest.setIndex("products");
searchRequest.setQuery(query);
SearchResponse searchResponse = searchApi.search(searchRequest);
System.out.println(searchResponse.toString() );
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 1
hits: [{_id=1, _score=1, _source={product_codes=[1, 2, 3, 4], title=first}}]
aggregations: null
}
profile: null
}
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("title", "first");
doc.Add("product_codes", new List<Object> {4,2,1,3});
InsertDocumentRequest newdoc = new InsertDocumentRequest(index: "products", id: 1, doc: doc);
var sqlresult = indexApi.Insert(newdoc);
object query = new { match_all=null };
var searchRequest = new SearchRequest("products", query);
var searchResponse = searchApi.Search(searchRequest);
Console.WriteLine(searchResponse.ToString())
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 1
hits: [{_id=1, _score=1, _source={product_codes=[1, 2, 3, 4], title=first}}]
aggregations: null
}
profile: null
}
A data type that allows storing variable-length lists of 64-bit signed integers. It has the same functionality as multi-value integer.
CREATE TABLE products(title text, values multi64);
POST /cli -d "CREATE TABLE products(title text, values multi64)"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'values'=>['type'=>'multi64']
]);
utilsApi.sql('CREATE TABLE products(title text, values multi64))')
res = await utilsApi.sql('CREATE TABLE products(title text, values multi64))');
utilsApi.sql("CREATE TABLE products(title text, values multi64))");
utilsApi.Sql("CREATE TABLE products(title text, values multi64))");
table products
{
type = rt
path = products
rt_field = title
stored_fields = title
rt_attr_multi_64 = values
}
When you use the columnar storage you can specify the following properties for the attributes.
By default, Manticore Columnar storage stores all attributes in a columnar fashion, as well as in a special docstore row by row. This enables fast execution of queries like SELECT * FROM ..., especially when fetching a large number of records at once. However, if you are sure that you do not need it or wish to save disk space, you can disable it by specifying fast_fetch='0' when creating a table or (if you are defining a table in a config) by using columnar_no_fast_fetch as shown in the following example.
create table t(a int, b int fast_fetch='0') engine='columnar'; desc t;
+-------+--------+---------------------+
| Field | Type | Properties |
+-------+--------+---------------------+
| id | bigint | columnar fast_fetch |
| a | uint | columnar fast_fetch |
| b | uint | columnar |
+-------+--------+---------------------+
3 rows in set (0.00 sec)
source min {
type = mysql
sql_host = localhost
sql_user = test
sql_pass =
sql_db = test
sql_query = select 1, 1 a, 1 b
sql_attr_uint = a
sql_attr_uint = b
}
table tbl {
path = tbl/col
source = min
columnar_attrs = *
columnar_no_fast_fetch = b
}
+-------+--------+---------------------+
| Field | Type | Properties |
+-------+--------+---------------------+
| id | bigint | columnar fast_fetch |
| a | uint | columnar fast_fetch |
| b | uint | columnar |
+-------+--------+---------------------+
In Manticore Search, there are two ways to manage tables:
Real-time mode requires no table definition in the configuration file. However, the data_dir directive in the searchd section is mandatory. Index files are stored inside the data_dir.
Replication is only available in this mode.
You can use SQL commands such as CREATE TABLE, ALTER TABLE and DROP TABLE to create and modify table schema, and to drop it. This mode is particularly useful for real-time and percolate tables.
Table names are converted to lowercase when created.
In this mode, you can specify the table schema in the configuration file. Manticore reads this schema on startup and creates the table if it doesn't exist yet. This mode is particularly useful for plain tables that use data from an external storage.
To drop a table, remove it from the configuration file or remove the path setting and send a HUP signal to the server or restart it.
Table names are case-sensitive in this mode.
All table types are supported in this mode.
| Table type | RT mode | Plain mode |
|---|---|---|
| Real-time | supported | supported |
| Plain | not supported | supported |
| Percolate | supported | supported |
| Distributed | supported | supported |
| Template | not supported | supported |
A real-time table is a main type of table in Manticore. It lets you add, update, and delete documents, and you can see these changes right away. You can set up a real-time Table in a configuration file or use commands like CREATE, UPDATE, DELETE, or ALTER.
Internally a real-time table consists of one or more plain tables called chunks. There are two kinds of chunks:
The size of the RAM chunk is controlled by the rt_mem_limit setting. Once this limit is reached, the RAM chunk is transferred to disk as a disk chunk. If there are too many disk chunks, Manticore combines some of them to improve performance.
You can create a new real-time table in two ways: by using the CREATE TABLE command or through the _mapping endpoint of the HTTP JSON API.
You can use this command via both SQL and HTTP protocols:
CREATE TABLE products(title text, price float) morphology='stem_en';
Query OK, 0 rows affected (0.00 sec)
POST /cli -d "CREATE TABLE products(title text, price float) morphology='stem_en'"
{
"total":0,
"error":"",
"warning":""
}
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float'],
]);
utilsApi.sql('CREATE TABLE forum(title text, price float)')
res = await utilsApi.sql('CREATE TABLE forum(title text, price float)');
utilsApi.sql("CREATE TABLE forum(title text, price float)");
utilsApi.Sql("CREATE TABLE forum(title text, price float)");
table products {
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
stored_fields = title
}
Alternatively, you can create a new table via the _mapping endpoint. This endpoint allows you to define an Elasticsearch-like table structure to be converted to a Manticore table.
The body of your request must have the following structure:
"properties"
{
"FIELD_NAME_1":
{
"type": "FIELD_TYPE_1"
},
"FIELD_NAME_2":
{
"type": "FIELD_TYPE_2"
},
...
"FIELD_NAME_N":
{
"type": "FIELD_TYPE_M"
}
}
When creating a table, Elasticsearch data types will be mapped to Manticore types according to the following rules:
- aggregate_metric => json
- binary => string
- boolean => bool
- byte => int
- completion => string
- date => timestamp
- date_nanos => bigint
- date_range => json
- dense_vector => json
- flattened => json
- flat_object => json
- float => float
- float_range => json
- geo_point => json
- geo_shape => json
- half_float => float
- histogram => json
- integer => int
- integer_range => json
- ip => string
- ip_range => json
- keyword => string
- knn_vector => float_vector
- long => bigint
- long_range => json
- match_only_text => text
- object => json
- point => json
- scaled_float => float
- search_as_you_type => text
- shape => json
- short => int
- text => text
- unsigned_long => int
- version => string
POST /your_table_name/_mapping -d '
{
"test": {
"mappings": {
"properties": {
"price": {
"type": "float"
},
"title": {
"type": "text"
}
}
}
}
}
'
{
"total":0,
"error":"",
"warning":""
}
ALTER command, as explained in Change schema online.The following table outlines the different file extensions and their respective descriptions in a real-time table:
| Extension | Description |
|---|---|
.lock |
A lock file that ensures that only one process can access the table at a time. |
.ram |
The RAM chunk of the table, stored in memory and used as an accumulator of changes. |
.meta |
The headers of the real-time table that define its structure and settings. |
.*.sp* |
Disk chunks that are stored on disk with the same format as plain tables. They are created when the RAM chunk size exceeds the rt_mem_limit. |
For more information on the structure of disk chunks, refer to the plain table files structure.
Plain table is a basic element for non-percolate searching. It can be defined only in a configuration file using the Plain mode, and is not supported in the RT mode. It is typically used in conjunction with a source to process data from the external storage and can later be attached to a real-time table.
To create a plain table, you'll need to define it in a configuration file. It's not supported by the CREATE TABLE command.
Here's an example of a plain table configuration and a source for fetching data from a MySQL database:
source source {
type = mysql
sql_host = localhost
sql_user = myuser
sql_pass = mypass
sql_db = mydb
sql_query = SELECT id, title, description, category_id from mytable
sql_attr_uint = category_id
sql_field_string = title
}
table tbl {
type = plain
source = source
path = /path/to/table
}
killlist_targetNumeric attributes, including MVAs, are the only elements that can be updated in a plain table. All other data in the table is immutable. If updates or new records are required, the table must be rebuilt. During the rebuilding process, the existing table remains available to serve requests, and a process called rotation is performed when the new version is ready, bringing it online and discarding the old version.
The speed at which a plain table is indexed depends on several factors, including:
For small data sets, the simplest option is to have a single plain table that is fully rebuilt as needed. This approach is acceptable when:
For larger data sets, a plain table can be used instead of a Real-Time. The main+delta scenario involves:
This approach allows for infrequent rebuilding of the larger table and more frequent processing of updates from the source. The smaller table can be rebuilt more often (e.g. every minute or even every few seconds).
However, as time goes on, the indexing duration for the smaller table will become too long, requiring a rebuild of the larger table and the emptying of the smaller one.
The main+delta schema is explained in detail in this interactive course.
The mechanism of kill list and killlist_target directive is used to ensure that documents from the current table take precedence over those from the other table.
For more information on this topic, see here.
The following table outlines the various file extensions used in a plain table and their respective descriptions:
| Extension | Description |
|---|---|
.spa |
stores document attributes in row-wise mode |
.spb |
stores blob attributes in row-wise mode: strings, MVA, json |
.spc |
stores document attributes in columnar mode |
.spd |
stores matching document ID lists for each word ID |
.sph |
stores table header information |
.sphi |
stores histograms of attribute values |
.spi |
stores word lists (word IDs and pointers to .spd file) |
.spidx |
stores secondary indexes data |
.spk |
stores kill-lists |
.spl |
lock file |
.spm |
stores a bitmap of killed documents |
.spp |
stores hit (aka posting, aka word occurrence) lists for each word ID |
.spt |
stores additional data structures to speed up lookups by document ids |
.spe |
stores skip-lists to speed up doc-list filtering |
.spds |
stores document texts |
.tmp* |
temporary files during index_settings_and_status |
.new.sp* |
new version of a plain table before rotation |
.old.sp* |
old version of a plain table after rotation |
table <index_name>[:<parent table name>] {
...
}
table <table name> {
type = plain
path = /path/to/table
source = <source_name>
source = <another source_name>
[stored_fields = <comma separated list of full-text fields that should be stored, all are stored by default, can be empty>]
}
table <table name> {
type = rt
path = /path/to/table
rt_field = <full-text field name>
rt_field = <another full-text field name>
[rt_attr_uint = <integer field name>]
[rt_attr_uint = <another integer field name, limit by N bits>:N]
[rt_attr_bigint = <bigint field name>]
[rt_attr_bigint = <another bigint field name>]
[rt_attr_multi = <multi-integer (MVA) field name>]
[rt_attr_multi = <another multi-integer (MVA) field name>]
[rt_attr_multi_64 = <multi-bigint (MVA) field name>]
[rt_attr_multi_64 = <another multi-bigint (MVA) field name>]
[rt_attr_float = <float field name>]
[rt_attr_float = <another float field name>]
[rt_attr_float_vector = <float vector field name>]
[rt_attr_float_vector = <another float vector field name>]
[rt_attr_bool = <boolean field name>]
[rt_attr_bool = <another boolean field name>]
[rt_attr_string = <string field name>]
[rt_attr_string = <another string field name>]
[rt_attr_json = <json field name>]
[rt_attr_json = <another json field name>]
[rt_attr_timestamp = <timestamp field name>]
[rt_attr_timestamp = <another timestamp field name>]
[stored_fields = <comma separated list of full-text fields that should be stored, all are stored by default, can be empty>]
[rt_mem_limit = <RAM chunk max size, default 128M>]
[optimize_cutoff = <max number of RT table disk chunks>]
}
type = plain
type = rt
Table type: "plain" or "rt" (real-time)
Value: plain (default), rt
path = path/to/table
The path to where the table will be stored or located, either absolute or relative, without the extension.
Value: The path to the table, mandatory
stored_fields = title, content
By default, the original content of full-text fields is indexed and stored when a table is defined in a configuration file. This setting allows you to specify the fields that should have their original values stored.
Value: A comma-separated list of full-text fields that should be stored. An empty value (i.e. stored_fields = ) disables the storage of original values for all fields.
Note: In the case of a real-time table, the fields listed in stored_fields should also be declared as rt_field.
Also, note that you don't need to list attributes in stored_fields, since their original values are stored anyway. stored_fields can only be used for full-text fields.
See also docstore_block_size, docstore_compression for document storage compression options.
CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)
POST /cli -d "
CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)"
$params = [
'body' => [
'columns' => [
'title'=>['type'=>'text'],
'content'=>['type'=>'text', 'options' => ['indexed', 'stored']],
'name'=>['type'=>'text', 'options' => ['indexed']],
'price'=>['type'=>'float']
]
],
'index' => 'products'
];
$index = new \Manticoresearch\Index($client);
$index->create($params);
utilsApi.sql('CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)')
res = await utilsApi.sql('CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)');
utilsApi.sql("CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)");
utilsApi.Sql("CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)");
table products {
stored_fields = title, content # we want to store only "title" and "content", "name" shouldn't be stored
type = rt
path = tbl
rt_field = title
rt_field = content
rt_field = name
rt_attr_uint = price
}
stored_only_fields = title,content
List of fields that will be stored in the table, but not indexed. This setting works similarly to stored_fields except that when a field is specified in stored_only_fields t will only be stored, not indexed, and cannot be searched using full-text queries. It can only be retrieved in search results.
The value is a comma-separated list of fields that should be stored only, not indexed. By default, this value is empty. If a real-time table is being defined, the fields listed in stored_only_fields must also be declared as rt_field.
Note also, that you don't need to list attributes instored_only_fields,since their original values are stored anyway. If to compare stored_only_fields to string attributes the former (stored field):
In contrast, the latter (string attribute):
The maximum number of disk chunks for the RT table. Learn more here.
rt_field = subject
This field declaration determines the full-text fields that will be indexed. The field names must be unique, and the order is preserved. When inserting data, the field values must be in the same order as specified in the configuration.
This is a multi-value, optional field.
rt_attr_uint = gid
This declaration defines an unsigned integer attribute.
Value: the field name or field_name:N (where N is the maximum number of bits to keep).
rt_attr_bigint = gid
This declaration defines a BIGINT attribute.
Value: field name, multiple records allowed.
rt_attr_multi = tags
Declares a multi-valued attribute (MVA) with unsigned 32-bit integer values.
Value: field name. Multiple records allowed.
rt_attr_multi_64 = wide_tags
Declares a multi-valued attribute (MVA) with signed 64-bit BIGINT values.
Value: field name. Multiple records allowed.
rt_attr_float = lat
rt_attr_float = lon
Declares floating point attributes with single precision, 32-bit IEEE 754 format.
Value: field name. Multiple records allowed.
rt_attr_float_vector = image_vector
Declares a vector of floating-point values.
Value: field name. Multiple records allowed.
rt_attr_bool = available
Declares a boolean attribute with 1-bit unsigned integer values.
Value: field name.
rt_attr_string = title
String attribute declaration.
Value: field name.
rt_attr_json = properties
Declares a JSON attribute.
Value: field name.
rt_attr_timestamp = date_added
Declares a timestamp attribute.
Value: field name.
rt_mem_limit = 512M
Memory limit for a RAM chunk of the table. Optional, default is 128M.
RT tables store some data in memory, known as the "RAM chunk," and also maintain a number of on-disk tables, referred to as "disk chunks." This directive allows you to control the size of the RAM chunk. When there is too much data to keep in memory, RT tables will flush it to disk, activate a newly created disk chunk, and reset the RAM chunk.
Please note that the limit is strict, and RT tables will never allocate more memory than what is specified in the rt_mem_limit. Additionally, memory is not preallocated, so specifying a 512MB limit and only inserting 3MB of data will result in allocating only 3MB, not 512MB.
The rt_mem_limit is never exceeded, but the actual RAM chunk size can be significantly lower than the limit. RT tables adapt to the data insertion pace and adjust the actual limit dynamically to minimize memory usage and maximize data write speed. This is how it works:
rt_mem_limit, referred to as the "rt_mem_limit".rt_mem_limit * rate data (50% of rt_mem_limit by default), Manticore starts saving the RAM chunk as a new disk chunk.rt_mem_limit rate is updated.For instance, if 90MB of data is saved to a disk chunk and an additional 10MB of data arrives while the save is in progress, the rate would be 90%. Next time, the RT table will collect up to 90% of rt_mem_limit before flushing the data. The faster the insertion pace, the lower the rt_mem_limit rate. The rate varies between 33.3% to 95%. You can view the current rate of a table using the SHOW TABLE
In real-time mode, you can adjust the size limit of RAM chunks and the maximum number of disk chunks using the ALTER TABLE statement. To set rt_mem_limit to 1 gigabyte for the table "t," run the following query: ALTER TABLE t rt_mem_limit='1G'. To change the maximum number of disk chunks, run the query: ALTER TABLE t optimize_cutoff='5'.
In the plain mode, you can change the values of rt_mem_limit and optimize_cutoff by updating the table configuration or running the command ALTER TABLE <index_name> RECONFIGURE
rt_mem_limit).rt_mem_limit setting..ram file and when the RAM chunk is full and dumped to disk as a disk chunk.rt_mem_limit, setting will increase the time it takes to replay the binary log and recover the RAM chunk.rt_mem_limit Manticore may take up more memory in some cases, such as when you start a transaction to insert data and don't commit it for a while. In this case, the data you have already transmitted within the transaction will remain in memory.source = srcpart1
source = srcpart2
source = srcpart3
The source field specifies the source from which documents will be obtained during indexing of the current table. There must be at least one source. The sources can be of different types (e.g. one could be MySQL, another PostgreSQL). For more information on indexing from external storages, indexing from external storages here
Value: The name of the source is mandatory. Multiple values are allowed.
killlist_target = main:kl
This setting determines the table(s) to which the kill-list will be applied. Matches in the targeted table that are updated or deleted in the current table will be suppressed. In :kl mode, the documents to suppress are taken from the kill-list. In :id mode, all document IDs from the current table are suppressed in the targeted one. If neither is specified, both modes will take effect. Learn more about kill-lists here
Value: not specified (default), target_index_name:kl, target_index_name:id, target_index_name. Multiple values are allowed
columnar_attrs = *
columnar_attrs = id, attr1, attr2, attr3
This configuration setting determines which attributes should be stored in the columnar storage instead of the row-wise storage.
You can set columnar_attrs = * to store all supported data types in the columnar storage.
Additionally, id is a supported attribute to store in the columnar storage.
columnar_strings_no_hash = attr1, attr2, attr3
By default, all string attributes stored in columnar storage store pre-calculated hashes. These hashes are used for grouping and filtering. However, they occupy extra space, and if you don't need to group by that attribute, you can save space by disabling hash generation.
CREATE TABLE [IF NOT EXISTS] name ( <field name> <field data type> [data type options] [, ...]) [table_options]
For more information on data types, see more about data types here.
| Type | Equivalent in a configuration file | Notes | Aliases |
|---|---|---|---|
| text | rt_field | Options: indexed, stored. Default: both. To keep text stored, but indexed, specify "stored" only. To keep text indexed only, specify "indexed" only. | string |
| integer | rt_attr_uint | integer | int, uint |
| bigint | rt_attr_bigint | big integer | |
| float | rt_attr_float | float | |
| float_vector | rt_attr_float_vector | a vector of float values | |
| multi | rt_attr_multi | multi-integer | |
| multi64 | rt_attr_multi_64 | multi-bigint | |
| bool | rt_attr_bool | boolean | |
| json | rt_attr_json | JSON | |
| string | rt_attr_string | string. Option indexed, attribute will make the value full-text indexed and filterable, sortable and groupable at the same time |
|
| timestamp | rt_attr_timestamp | timestamp | |
| bit(n) | rt_attr_uint field_name:N | N is the max number of bits to keep |
CREATE TABLE products (title text, price float) morphology='stem_en'
CREATE TABLE products (title text indexed, description text stored, author text, price float)
create table ... engine='columnar';
create table ... engine='rowwise';
The engine setting changes the default attribute storage for all attributes in the table. You can also specify engine separately for each attribute.
For information on how to enable columnar storage for a plain table, see columnar_attrs .
Values:
The following settings are applicable for both real-time and plain tables, regardless of whether they are specified in a configuration file or set online using the CREATE or ALTER command.
Manticore supports two access modes for reading table data: seek+read and mmap.
In seek+read mode, the server uses the pread system call to read document lists and keyword positions, represented by the*.spd and *.spp files. The server uses internal read buffers to optimize the reading process, and the size of these buffers can be adjusted using the options read_buffer_docs and read_buffer_hits.There is also the option preopen that controls how Manticore opens files at start.
In mmap access mode, the search server maps the table's file into memory using the mmap system call, and the OS caches the file contents. The options read_buffer_docs and read_buffer_hits have no effect for corresponding files in this mode. The mmap reader can also lock the table's data in memory using themlock privileged call, which prevents the OS from swapping the cached data out to disk.
To control which access mode to use, the options access_plain_attrs, access_blob_attrs, access_doclists, access_hitlists and access_dict are available, with the following values:
| Value | Description |
|---|---|
| file | server reads the table files from disk with seek+read using internal buffers on file access |
| mmap | server maps the table files into memory and OS caches up its contents on file access |
| mmap_preread | server maps the table files into memory and a background thread reads it once to warm up the cache |
| mlock | server maps the table files into memory and then executes the mlock() system call to cache up the file contents and lock it into memory to prevent it being swapped out |
| Setting | Values | Description |
|---|---|---|
| access_plain_attrs | mmap, mmap_preread (default), mlock | controls how *.spa (plain attributes) *.spe (skip lists) *.spt (lookups) *.spm (killed docs) will be read |
| access_blob_attrs | mmap, mmap_preread (default), mlock | controls how *.spb (blob attributes) (string, mva and json attributes) will be read |
| access_doclists | file (default), mmap, mlock | controls how *.spd (doc lists) data will be read |
| access_hitlists | file (default), mmap, mlock | controls how *.spp (hit lists) data will be read |
| access_dict | mmap, mmap_preread (default), mlock | controls how *.spi (dictionary) will be read |
Here is a table which can help you select your desired mode:
| table part | keep it on disk | keep it in memory | cached in memory on server start | lock it in memory |
|---|---|---|---|---|
| plain attributes in row-wise (non-columnar) storage, skip lists, word lists, lookups, killed docs | mmap | mmap | mmap_preread (default) | mlock |
| row-wise string, multi-value attributes (MVA) and json attributes | mmap | mmap | mmap_preread (default) | mlock |
| columnar numeric, string and multi-value attributes | always | only by means of OS | no | not supported |
| doc lists | file (default) | mmap | no | mlock |
| hit lists | file (default) | mmap | no | mlock |
| dictionary | mmap | mmap | mmap_preread (default) | mlock |
mlock. Additionally, use mlock for doclists/hitlists.mmap_preread option.mlock. The operating system will determine what should be kept in memory based on frequent disk reads.access_doclists/access_hitlists=fileThe default mode offers a balance of:
This provides a decent search performance, optimal memory utilization, and faster searchd restart in most scenarios.
attr_update_reserve = 256k
This setting reserves extra space for updates to blob attributes such as multi-value attributes (MVA), strings, and JSON. The default value is 128k. When updating these attributes, their length may change. If the updated string is shorter than the previous one, it will overwrite the old data in the *.spb file. If the updated string is longer, it will be written to the end of the *.spb file. This file is memory-mapped, making resizing it a potentially slow process, depending on the operating system's memory-mapped file implementation. To avoid frequent resizing, you can use this setting to reserve extra space at the end of the .spb file.
Value: size, default 128k.
docstore_block_size = 32k
This setting controls the size of blocks used by the document storage. The default value is 16kb. When original document text is stored using stored_fields or stored_only_fields, it is stored within the table and compressed for efficiency. To optimize disk access and compression ratios for small documents, these documents are concatenated into blocks. The indexing process collects documents until their total size reaches the threshold specified by this option. At that point, the block of documents is compressed. This option can be adjusted to achieve better compression ratios (by increasing the block size) or faster access to document text (by decreasing the block size).
Value: size, default 16k.
docstore_compression = lz4hc
This setting determines the type of compression used for compressing blocks of documents stored in document storage. If stored_fields or stored_only_fields are specified, the document storage stores compressed document blocks. 'lz4' offers fast compression and decompression speeds, while 'lz4hc' (high compression) sacrifices some compression speed for a better compression ratio. 'none' disables compression completely.
Values: lz4 (default), lz4hc, none.
docstore_compression_level = 12
The compression level used when 'lz4hc' compression is applied in document storage. By adjusting the compression level, you can find the right balance between performance and compression ratio when using 'lz4hc' compression. Note that this option is not applicable when using 'lz4' compression.
Value: An integer between 1 and 12, with a default of 9.
preopen = 1
This setting indicates that searchd should open all table files on startup or rotation, and keep them open while running. By default, the files are not pre-opened. Pre-opened tables require a few file descriptors per table, but they eliminate the need for per-query open() calls and are immune to race conditions that might occur during table rotation under high load. However, if you are serving many tables, it may still be more efficient to open them on a per-query basis in order to conserve file descriptors.
Value: 0 (default), or 1.
read_buffer_docs = 1M
Buffer size for storing the list of documents per keyword. Increasing this value will result in higher memory usage during query execution, but may reduce I/O time.
Value: size, default 256k, minimum value is 8k.
read_buffer_hits = 1M
Buffer size for storing the list of hits per keyword. Increasing this value will result in higher memory usage during query execution, but may reduce I/O time.
Value: size, default 256k, minimum value is 8k.
inplace_enable = {0|1}
Enables in-place table inversion. Optional, default is 0 (uses separate temporary files).
The inplace_enable option reduces the disk footprint during indexing of plain tables, while slightly slowing down indexing (it uses approximately 2 times less disk, but yields around 90-95% of the original performance).
Indexing is comprised of two primary phases. During the first phase, documents are collected, processed, and partially sorted by keyword, and the intermediate results are written to temporary files (.tmp*). During the second phase, the documents are fully sorted and the final table files are created. Rebuilding a production table on-the-fly requires approximately 3 times the peak disk footprint: first for the intermediate temporary files, second for the newly constructed copy, and third for the old table that will be serving production queries in the meantime. (Intermediate data is comparable in size to the final table.) This may be too much disk footprint for large data collections, and the inplace_enable option can be used to reduce it. When enabled, it reuses the temporary files, outputs the final data back to them, and renames them upon completion. However, this may require additional temporary data chunk relocation, which is where the performance impact comes from.
This directive has no effect on searchd, it only affects the indexer.
table products {
inplace_enable = 1
path = products
source = src_base
}
inplace_hit_gap = size
The option In-place inversion fine-tuning option. Controls preallocated hitlist gap size. Optional, default is 0.
This directive only affects the searchd tool, and does not have any impact on the indexer.
table products {
inplace_hit_gap = 1M
inplace_enable = 1
path = products
source = src_base
}
inplace_reloc_factor = 0.1
The inplace_reloc_factor setting determines the size of the relocation buffer within the memory arena used during indexing. The default value is 0.1.
This option is optional and only affects the indexer tool, not the searchd server.
table products {
inplace_reloc_factor = 0.1
inplace_enable = 1
path = products
source = src_base
}
inplace_write_factor = 0.1
Controls the size of the buffer used for in-place writing during indexing. Optional, with a default value of 0.1.
It's important to note that this directive only impacts the indexer tool and not the searchd server.
table products {
inplace_write_factor = 0.1
inplace_enable = 1
path = products
source = src_base
}
The following settings are supported. They are all described in section NLP and tokenization.
A percolate table is a special table that stores queries rather than documents. It is used for prospective searches, or "search in reverse."
The schema of a percolate table is fixed and contains the following fields:
| Field | Description |
|---|---|
| ID | An unsigned 64-bit integer with auto-increment functionality. It can be omitted when adding a PQ rule, as described in add a PQ rule |
| Query | Full-text query of the rule, which can be thought of as the value of MATCH clause or JSON /search. If per field operators are used inside the query, the full-text fields need to be declared in the percolate table configuration. If the stored query is only for attribute filtering (without full-text querying), the query value can be empty or omitted. The value of this field should correspond to the expected document schema, which is specified when creating the percolate table. |
| Filters | Optional. Filters are an optional string containing attribute filters and/or expressions, defined the same way as in the WHERE clause or JSON filtering. The value of this field should correspond to the expected document schema, which is specified when creating the percolate table. |
| Tags | Optional. Tags represent a list of string labels separated by commas that can be used for filtering/deleting PQ rules. The tags can also be returned along with matching documents when performing a Percolate query |
Note that you do not need to add the above fields when creating a percolate table.
What you need to keep in mind when creating a new percolate table is to specify the expected schema of a document, which will be checked against the rules you will add later. This is done in the same way as for any other local table.
CREATE TABLE products(title text, meta json) type='pq';
Query OK, 0 rows affected (0.00 sec)
POST /cli -d "CREATE TABLE products(title text, meta json) type='pq'"
{
"total":0,
"error":"",
"warning":""
}
$index = [
'index' => 'products',
'body' => [
'columns' => [
'title' => ['type' => 'text'],
'meta' => ['type' => 'json']
],
'settings' => [
'type' => 'pq'
]
]
];
$client->indices()->create($index);
Array(
[total] => 0
[error] =>
[warning] =>
)
utilsApi.sql('CREATE TABLE products(title text, meta json) type=\'pq\'')
res = await utilsApi.sql('CREATE TABLE products(title text, meta json) type=\'pq\'');
utilsApi.sql("CREATE TABLE products(title text, meta json) type='pq'");
utilsApi.Sql("CREATE TABLE products(title text, meta json) type='pq'");
res = await utilsApi.sql("CREATE TABLE products(title text, meta json) type='pq'");
apiClient.UtilsAPI.Sql(context.Background()).Body("CREATE TABLE products(title text, meta json) type='pq'").Execute()
table products {
type = percolate
path = tbl_pq
rt_field = title
rt_attr_json = meta
}
A Template Table is a special type of table in Manticore that doesn't store any data and doesn't create any files on your disk. Despite this, it can have the same NLP settings as a plain or real-time table. Template tables can be used for the following purposes:
table template {
type = template
morphology = stem_en
wordforms = wordforms.txt
exceptions = exceptions.txt
stopwords = stopwords.txt
}
Manticore doesn't store text as is for performing full-text searching on it. Instead, it extracts words and creates several structures that allow fast full-text searching. From the found words, a dictionary is built, which allows a quick look to discover if the word is present or not in the index. In addition, other structures record the documents and fields in which the word was found (as well as the position of it inside a field). All these are used when a full-text match is performed.
The process of demarcating and classifying words is called tokenization. The tokenization is applied at both indexing and searching, and it operates at the character and word level.
On the character level, the engine allows only certain characters to pass. This is defined by the charset_table. Anything else is replaced with a whitespace (which is considered the default word separator). The charset_table also allows mappings, such as lowercasing or simply replacing one character with another. Besides that, characters can be ignored, blended, defined as a phrase boundary.
At the word level, the base setting is the min_word_len which defines the minimum word length in characters to be accepted in the index. A common request is to match singular with plural forms of words. For this, morphology processors can be used.
Going further, we might want a word to be matched as another one because they are synonyms. For this, the word forms feature can be used, which allows one or more words to be mapped to another one.
Very common words can have some unwanted effects on searching, mostly because of their frequency they require lots of computing to process their doc/hit lists. They can be blacklisted with the stop words functionality. This helps not only in speeding up queries but also in decreasing the index size.
A more advanced blacklisting is bigrams, which allows creating a special token between a "bigram" (common) word and an uncommon word. This can speed up several times when common words are used in phrase searches.
In case of indexing HTML content, it's important not to index the HTML tags, as they can introduce a lot of "noise" in the index. HTML stripping can be used and can be configured to strip, but index certain tag attributes or completely ignore the content of certain HTML elements.
Manticore supports a wide range of languages, with basic support enabled for most languages via charset_table = non_cjk (which is the default value).
For many languages, Manticore provides a stopwords file that can be used to improve search relevance.
Additionally, advanced morphology is available for a few languages that can significantly improve search relevance by using dictionary-based lemmatization or stemming algorithms for better segmentation and normalization.
The table below lists all supported languages and indicates how to enable:
| Language | Supported | Stopwords file name | Advanced morphology | Notes |
|---|---|---|---|---|
| Afrikaans | charset_table=non_cjk | af | - | |
| Arabic | charset_table=non_cjk | ar | morphology=stem_ar (Arabic stemmer); morphology=libstemmer_ar | |
| Armenian | charset_table=non_cjk | hy | - | |
| Assamese | specify charset_table specify charset_table manually | - | - | |
| Basque | charset_table=non_cjk | eu | - | |
| Bengali | charset_table=non_cjk | bn | - | |
| Bishnupriya | specify charset_table manually | - | - | |
| Buhid | specify charset_table manually | - | - | |
| Bulgarian | charset_table=non_cjk | bg | - | |
| Catalan | charset_table=non_cjk | ca | morphology=libstemmer_ca | |
| Chinese | charset_table=chinese or ngram_chars=chinese | zh | morphology=icu_chinese or ngram_chars=1 correspondingly | ICU dictionary based segmentation is much more accurate than ngram-based |
| Croatian | charset_table=non_cjk | hr | - | |
| Kurdish | charset_table=non_cjk | ckb | - | |
| Czech | charset_table=non_cjk | cz | morphology=stem_cz (Czech stemmer) | |
| Danish | charset_table=non_cjk | da | morphology=libstemmer_da | |
| Dutch | charset_table=non_cjk | nl | morphology=libstemmer_nl | |
| English | charset_table=non_cjk | en | morphology=lemmatize_en (single root form); morphology=lemmatize_en_all (all root forms); morphology=stem_en (Porter's English stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_en (English from libstemmer) | |
| Esperanto | charset_table=non_cjk | eo | - | |
| Estonian | charset_table=non_cjk | et | - | |
| Finnish | charset_table=non_cjk | fi | morphology=libstemmer_fi | |
| French | charset_table=non_cjk | fr | morphology=libstemmer_fr | |
| Galician | charset_table=non_cjk | gl | - | |
| Garo | specify charset_table manually | - | - | |
| German | charset_table=non_cjk | de | morphology=lemmatize_de (single root form); morphology=lemmatize_de_all (all root forms); morphology=libstemmer_de | |
| Greek | charset_table=non_cjk | el | morphology=libstemmer_el | |
| Hebrew | charset_table=non_cjk | he | - | |
| Hindi | charset_table=non_cjk | hi | morphology=libstemmer_hi | |
| Hmong | specify charset_table manually | - | - | |
| Ho | specify charset_table manually | - | - | |
| Hungarian | charset_table=non_cjk | hu | morphology=libstemmer_hu | |
| Indonesian | charset_table=non_cjk | id | morphology=libstemmer_id | |
| Irish | charset_table=non_cjk | ga | morphology=libstemmer_ga | |
| Italian | charset_table=non_cjk | it | morphology=libstemmer_it | |
| Japanese | ngram_chars=japanese | - | ngram_chars=japanese ngram_len=1 | Requires ngram-based segmentation |
| Komi | specify charset_table manually | - | - | |
| Korean | ngram_chars=korean | - | ngram_chars=korean ngram_len=1 | Requires ngram-based segmentation |
| Large Flowery Miao | specify charset_table manually | - | - | |
| Latin | charset_table=non_cjk | la | - | |
| Latvian | charset_table=non_cjk | lv | - | |
| Lithuanian | charset_table=non_cjk | lt | morphology=libstemmer_lt | |
| Maba | specify charset_table manually | - | - | |
| Maithili | specify charset_table manually | - | - | |
| Marathi | specify charset_table manually | - | - | |
| Marathi | charset_table=non_cjk | mr | - | |
| Mende | specify charset_table manually | - | - | |
| Mru | specify charset_table manually | - | - | |
| Myene | specify charset_table manually | - | - | |
| Nepali | specify charset_table manually | - | morphology=libstemmer_ne | |
| Ngambay | specify charset_table manually | - | - | |
| Norwegian | charset_table=non_cjk | no | morphology=libstemmer_no | |
| Odia | specify charset_table manually | - | - | |
| Persian | charset_table=non_cjk | fa | - | |
| Polish | charset_table=non_cjk | pl | - | |
| Portuguese | charset_table=non_cjk | pt | morphology=libstemmer_pt | |
| Romanian | charset_table=non_cjk | ro | morphology=libstemmer_ro | |
| Russian | charset_table=non_cjk | ru | morphology=lemmatize_ru (single root form); morphology=lemmatize_ru_all (all root forms); morphology=stem_ru (Porter's Russian stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_ru (from libstemmer) | |
| Santali | specify charset_table manually | - | - | |
| Sindhi | specify charset_table manually | - | - | |
| Slovak | charset_table=non_cjk | sk | - | |
| Slovenian | charset_table=non_cjk | sl | - | |
| Somali | charset_table=non_cjk | so | - | |
| Sotho | charset_table=non_cjk | st | - | |
| Spanish | charset_table=non_cjk | es | morphology=libstemmer_es | |
| Swahili | charset_table=non_cjk | sw | - | |
| Swedish | charset_table=non_cjk | sv | morphology=libstemmer_sv | |
| Sylheti | specify charset_table manually | - | - | |
| Tamil | specify charset_table manually | - | morphology=libstemmer_ta | |
| Thai | charset_table=non_cjk | th | - | |
| Turkish | charset_table=non_cjk | tr | morphology=libstemmer_tr | |
| Ukrainian | charset_table=non_cjk,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491 | - | morphology=lemmatize_uk_all | Requires installation of UK lemmatizer |
| Yoruba | charset_table=non_cjk | yo | - | |
| Zulu | charset_table=non_cjk | zu | - |
Manticore provides built-in support for indexing CJK texts, allowing you to process CJK texts in two different ways:
CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'
POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'charset_table' => 'cjk',
'morphology' => 'icu_chinese'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cjk\' morphology = \'icu_chinese\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cjk\' morphology = \'icu_chinese\'');
utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'");
utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'");
table products {
charset_table = cjk
morphology = icu_chinese
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
chinese, korean, japanese) that can be used, or you can use the common cjk character set table. CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'
POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'charset_table' => 'non_cjk',
'ngram_len' => '1',
'ngram_chars' => 'cjk'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'');
utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");
utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");
table products {
charset_table = non_cjk
ngram_len = 1
ngram_chars = cjk
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
Additionally, there is built-in support for Chinese stopwords with the alias zh.
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'
POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'charset_table' => 'chinese',
'morphology' => 'icu_chinese',
'stopwords' => 'zh'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'');
utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'");
utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'");
table products {
charset_table = chinese
morphology = icu_chinese
stopwords = zh
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
When text is indexed in Manticore, it is split into words and case folding is done so that words like "Abc", "ABC", and "abc" are treated as the same word.
To perform these operations correctly, Manticore must know:
You can configure these settings on a per-table basis using the charset_table option. charset_table specifies an array that maps letter characters to their case-folded versions (or any other characters that you prefer). Characters that are not present in the array are considered to be non-letters and will be treated as word separators during indexing or searching in this table.
The default character set is non_cjk, which includes most languages.
You can also define text pattern replacement rules. For example, with the following rules:
regexp_filter = \**(\d+)\" => \1 inch
regexp_filter = (BLUE|RED) => COLOR
The text RED TUBE 5" LONG would be indexed as COLOR TUBE 5 INCH LONG, and PLANK 2" x 4" would be indexed as PLANK 2 INCH x 4 INCH. These rules are applied in the specified order. The rules also apply to queries, so a search for BLUE TUBE would actually search for COLOR TUBE.
You can learn more about regexp_filter here.
# default
charset_table = non_cjk
# only English and Russian letters
charset_table = 0..9, A..Z->a..z, _, a..z, \
U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451
# english charset defined with alias
charset_table = 0..9, english, _
# you can override character mappings by redefining them, e.g. for case insensitive search with German umlauts you can use:
charset_table = non_cjk, U+00E4, U+00C4->U+00E4, U+00F6, U+00D6->U+00F6, U+00FC, U+00DC->U+00FC, U+00DF, U+1E9E->U+00DF
charset_table specifies an array that maps letter characters to their case folded versions (or any other characters if you like). The default character set is non_cjk which includes most non-CJK languages.
charset_table is a workhorse of Manticore's tokenization process, which extracts keywords from document text or query text. It controls what characters are accepted as valid and how they should be transformed (e.g. whether case should be removed or not).
By default, every character maps to 0, which means that it is not considered a valid keyword and is treated as a separator. Once a character is mentioned in the table, it is mapped to another character (most frequently, either to itself or to a lowercase letter) and is treated as a valid keyword part.
charset_table uses a comma-separated list of mappings to declare characters as valid or to map them to other characters. Syntax shortcuts are available for mapping ranges of characters at once:
A->a. Declares the source character 'A' as allowed within keywords and maps it to the destination character 'a' (but does not declare 'a' as allowed).A..Z->a..z. Declares all characters in the source range as allowed and maps them to the destination range. Does not declare the destination range as allowed. Checks the lengths of both ranges.a. Declares a character as allowed and maps it to itself. Equivalent to a->a single char mapping.a..z. Declares all characters in the range as allowed and maps them to themselves. Equivalent to a..z->a..z range mapping.A..Z/2. Maps every pair of characters to the second character. For instance, A..Z/2 is equivalent to A->B, B->B, C->D, D->D, ..., Y->Z, Z->Z. This mapping shortcut is helpful for Unicode blocks where uppercase and lowercase letters go in an interleaved order.For characters with codes from 0 to 32, and those in the range of 127 to 8-bit ASCII and Unicode characters, Manticore always treats them as separators. To avoid configuration file encoding issues, 8-bit ASCII characters and Unicode characters must be specified in U+XXX form, where XXX is a hexadecimal code point number. The minimal accepted Unicode character code is U+0021.
If the default mappings are insufficient for your needs, you can redefine the character mappings by specifying them again with another mapping. For example, if the built-in non_cjk array includes characters Ä and ä and maps them both to the ASCII character a, you can redefine those characters by adding the Unicode code points for them, like this:
charset_table = non_cjk,U+00E4,U+00C4
for case sensitive search or
charset_table = non_cjk,U+00E4,U+00C4->U+00E4
for case insensitive search.
CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'
POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'charset_table' => '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451\'');
utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'");
utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'");
table products {
charset_table = 0..9, A..Z->a..z, _, a..z, \
U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
Besides definitions of characters and mappings, there are several built-in aliases that can be used. Current aliases are:
englishrussiannon_cjkcjkCREATE TABLE products(title text, price float) charset_table = '0..9, english, _'
POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'charset_table' => '0..9, english, _'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'0..9, english, _\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'0..9, english, _\'');
utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'");
utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'");
table products {
charset_table = 0..9, english, _
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
If you want to support different languages in your search, it can be a laborious task to define sets of valid characters and folding rules for all of them. We have simplified this for you by providing default charset tables, non_cjk and cjk, that cover non-CJK and CJK (Chinese, Japanese, Korean) languages respectively. In most cases, these charsets should be sufficient for your needs.
Please note that the following languages are currently not supported:
All other languages listed in the Unicode languages
list are supported by default.
To work with both cjk and non-cjk languages, set the options in your configuration file as shown below (with an exception for Chinese):
CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'
POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'charset_table' => 'non_cjk',
'ngram_len' => '1',
'ngram_chars' => 'cjk'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'');
utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");
utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");
table products {
charset_table = non_cjk
ngram_len = 1
ngram_chars = cjk
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
If you do not require support for cjk-languages, you can simply exclude the ngram_len and ngram_chars
options. For more information on these options, refer to the corresponding documentation sections.
To map one character to multiple characters or vice versa, you can use regexp_filter can be helpful.
blend_chars = +, &, U+23
blend_chars = +, &->+
Blended characters list. Optional, default is empty.
Blended characters are indexed as both separators and valid characters. For example, when & is defined as a blended character and AT&T appears in an indexed document, three different keywords will be indexed, at&t, at and t.
Additionally, blended characters can influence indexing in such a way that keywords are indexed as if the blended characters were not typed at all. This behavior is particularly evident when blend_mode = trim_all is specified. For example, the phrase some_thing will be indexed as some, something, and thing with blend_mode = trim_all.
Care should be taken when using blended characters as defining a character as blended means that it is no longer a separator.
blend_chars and search for dog,cat, it will treat that as a single token dog,cat. If dog,cat was not indexed as dog,cat, but left as dog cat only, then it will not match.Positions for tokens obtained by replacing blended characters with whitespace are assigned as usual, and regular keywords will be indexed as if there were no blend_chars specified at all. An additional token that mixes blended and non-blended characters will be put at the starting position. For instance, if AT&T company occurs in the very beginning of the text field, at will be given position 1, t position 2, company position 3, and AT&T will also be given position 1, blending with the opening regular keyword. As a result, queries for AT&T or just AT will match that document. A phrase query for "AT T" will also match, as well as a phrase query for "AT&T company".
Blended characters can overlap with special characters used in query syntax, such as T-Mobile or @twitter. Where possible, the query parser will handle the blended character as blended. For instance, if hello @twitter is within quotes (a phrase operator), the query parser will handle the @ symbol as blended. However, if the @ symbol was not within quotes, the character would be handled as an operator. Therefore, it is recommended to escape keywords.
Blended characters can be remapped so that multiple different blended characters can be normalized into one base form. This is useful when indexing multiple alternative Unicode codepoints with equivalent glyphs.
CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'
POST /cli -d "
CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'blend_chars' => '+, &, U+23, @->_'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) blend_chars = \'+, &, U+23, @->_\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) blend_chars = \'+, &, U+23, @->_\'');
utilsApi.sql("CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'");
utilsApi.Sql("CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'");
table products {
blend_chars = +, &, U+23, @->_
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
blend_mode = option [, option [, ...]]
option = trim_none | trim_head | trim_tail | trim_both | trim_all | skip_pure
The blended tokens indexing mode is enabled by the blend_mode directive.
By default, tokens that mix blended and non-blended characters get indexed entirely. For example, when both an at-sign and an exclamation are in blend_chars, the string @dude! will be indexed as two tokens: @dude! (with all the blended characters) and dude (without any). As a result, a query of @dude will not match it.
blend_mode adds flexibility to this indexing behavior. It takes a comma-separated list of options, each of which specifies a token indexing variant.
If multiple options are specified, multiple variants of the same token will be indexed. Regular keywords (resulting from that token by replacing blended characters with a separator) are always indexed.
The options are:
trim_none - Index the entire tokentrim_head - Trim heading blended characters, and index the resulting tokentrim_tail - Trim trailing blended characters, and index the resulting tokentrim_both- Trim both heading and trailing blended characters, and index the resulting tokentrim_all - Trim heading, trailing, and middle blended characters, and index the resulting tokenskip_pure - Do not index the token if it is purely blended, that is, consists of blended characters onlyUsing blend_mode with the example @dude! string above, the setting blend_mode = trim_head, trim_tail would result in two indexed tokens: @dude and dude!. Using trim_both would have no effect because trimming both blended characters results in dude, which is already indexed as a regular keyword. Indexing @U.S.A. with trim_both (and assuming that dot is blended two) would result in U.S.A being indexed. Lastly, skip_pure enables you to ignore sequences of blended characters only. For example, one @@@ two would be indexed as one two, and match that as a phrase. This is not the case by default because a fully blended token gets indexed and offsets the second keyword position.
Default behavior is to index the entire token, equivalent to blend_mode = trim_none.
Be aware that using blend modes limits your search, even with the default mode trim_none if you assume . is a blended character:
.dog. will become dog. dog during indexingdog..Using more modes increases the chance your keyword will match something.
CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'
POST /cli -d "
CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'blend_mode' => 'trim_tail, skip_pure',
'blend_chars' => '+, &'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) blend_mode = \'trim_tail, skip_pure\' blend_chars = \'+, &\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) blend_mode = \'trim_tail, skip_pure\' blend_chars = \'+, &\'');
utilsApi.sql("CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'");
utilsApi.Sql("CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'");
table products {
blend_mode = trim_tail, skip_pure
blend_chars = +, &
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
min_word_len = length
min_word_len is an optional index configuration option in Manticore that specifies the minimum indexed word length. The default value is 1, which means that everything is indexed.
Only those words that are not shorter than this minimum will be indexed. For example, if min_word_len is 4, then 'the' won't be indexed, but 'they' will be.
CREATE TABLE products(title text, price float) min_word_len = '4'
POST /cli -d "
CREATE TABLE products(title text, price float) min_word_len = '4'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'min_word_len' => '4'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) min_word_len = \'4\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_word_len = \'4\'');
utilsApi.sql("CREATE TABLE products(title text, price float) min_word_len = '4'");
utilsApi.Sql("CREATE TABLE products(title text, price float) min_word_len = '4'");
table products {
min_word_len = 4
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
ngram_len = 1
N-gram lengths for N-gram indexing. Optional, default is 0 (disable n-gram indexing). Known values are 0 and 1.
N-grams provide basic CJK (Chinese, Japanese, Korean) support for unsegmented texts. The issue with CJK searching is that there may be no clear separators between the words. In some cases, you may not want to use dictionary-based segmentation the one available for Chinese. In those cases, n-gram segmentation might work well too.
When this feature is enabled, streams of CJK (or any other characters defined in ngram_chars) are indexed as N-grams. For example, if the incoming text is "ABCDEF" (where A to F represent some CJK characters) and ngram_len is 1, it will be indexed as if it were "A B C D E F". Only ngram_len=1 is currently supported. Only those characters that are listed in ngram_chars table will be split this way; others will not be affected.
Note that if the search query is segmented, i.e. there are separators between individual words, then wrapping the words in quotes and using extended mode will result in proper matches being found even if the text was not segmented. For instance, assume that the original query is BC DEF. After wrapping in quotes on the application side, it should look like "BC" "DEF" (with quotes). This query will be passed to Manticore and internally split into 1-grams too, resulting in "B C" "D E F" query, still with quotes that are the phrase matching operator. And it will match the text even though there were no separators in the text.
Even if the search query is not segmented, Manticore should still produce good results, thanks to phrase-based ranking: it will pull closer phrase matches (which in the case of N-gram CJK words can mean closer multi-character word matches) to the top.
CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'
POST /cli -d "
CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'ngram_chars' => 'cjk',
'ngram_len' => '1'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'cjk\' ngram_len = \'1\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'cjk\' ngram_len = \'1\'');
utilsApi.sql("CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'");
utilsApi.Sql("CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'");
table products {
ngram_chars = cjk
ngram_len = 1
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
ngram_chars = cjk
ngram_chars = cjk, U+3000..U+2FA1F
N-gram characters list. Optional, default is empty.
To be used in conjunction with in ngram_len, this list defines characters, sequences of which are subject to N-gram extraction. Words comprised of other characters will not be affected by N-gram indexing feature. The value format is identical to charset_table. N-gram characters cannot appear in the charset_table.
CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'
POST /cli -d "
CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'ngram_chars' => 'U+3000..U+2FA1F',
'ngram_len' => '1'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'U+3000..U+2FA1F\' ngram_len = \'1\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'U+3000..U+2FA1F\' ngram_len = \'1\'');
utilsApi.sql("CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'");
utilsApi.Sql("CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'");
table products {
ngram_chars = U+3000..U+2FA1F
ngram_len = 1
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
Also you can use an alias for our default N-gram table as in the example. It should be sufficient in most cases.
CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'
POST /cli -d "
CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'ngram_chars' => 'cjk',
'ngram_len' => '1'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'cjk\' ngram_len = \'1\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'cjk\' ngram_len = \'1\'');
utilsApi.sql("CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'");
utilsApi.Sql("CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'");
table products {
ngram_chars = cjk
ngram_len = 1
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
ignore_chars = U+AD
Ignored characters list. Optional, default is empty.
Useful in cases when some characters, such as soft hyphenation mark (U+00AD), should be not just treated as separators but rather fully ignored. For example, if '-' is simply not in the charset_table, "abc-def" text will be indexed as "abc" and "def" keywords. On the contrary, if '-' is added to ignore_chars list, the same text will be indexed as a single "abcdef" keyword.
The syntax is the same as for charset_table, but it's only allowed to declare characters, and not allowed to map them. Also, the ignored characters must not be present in charset_table.
CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'
POST /cli -d "
CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'ignore_chars' => 'U+AD'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) ignore_chars = \'U+AD\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) ignore_chars = \'U+AD\'');
utilsApi.sql("CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'");
utilsApi.Sql("CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'");
table products {
ignore_chars = U+AD
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
bigram_index = {none|all|first_freq|both_freq}
Bigram indexing mode. Optional, default is none.
Bigram indexing is a feature to accelerate phrase searches. When indexing, it stores a document list for either all or some of the adjacent words pairs into the index. Such a list can then be used at searching time to significantly accelerate phrase or sub-phrase matching.
bigram_index controls the selection of specific word pairs. The known modes are:
all, index every single word pairfirst_freq, only index word pairs where the first word is in a list of frequent words (see bigram_freq_words). For example, with bigram_freq_words = the, in, i, a, indexing "alone in the dark" text will result in "in the" and "the dark" pairs being stored as bigrams, because they begin with a frequent keyword (either "in" or "the" respectively), but "alone in" would not be indexed, because "in" is a second word in that pair.both_freq, only index word pairs where both words are frequent. Continuing with the same example, in this mode indexing "alone in the dark" would only store "in the" (the very worst of them all from searching perspective) as a bigram, but none of the other word pairs.For most use cases, both_freq would be the best mode, but your mileage may vary.
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'
POST /cli -d "
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'bigram_freq_words' => 'the, a, you, i',
'bigram_index' => 'both_freq'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) bigram_freq_words = \'the, a, you, i\' bigram_index = \'both_freq\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) bigram_freq_words = \'the, a, you, i\' bigram_index = \'both_freq\'');
utilsApi.sql("CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'");
utilsApi.Sql("CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'");
table products {
bigram_index = both_freq
bigram_freq_words = the, a, you, i
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
bigram_freq_words = the, a, you, i
A list of keywords considered "frequent" when indexing bigrams. Optional, default is empty.
Some of the bigram indexing modes (see bigram_index) require to define a list of frequent keywords. These are not to be confused with stop words. Stop words are completely eliminated when both indexing and searching. Frequent keywords are only used by bigrams to determine whether to index a current word pair or not.
bigram_freq_words lets you define a list of such keywords.
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'
POST /cli -d "
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'bigram_freq_words' => 'the, a, you, i',
'bigram_index' => 'first_freq'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) bigram_freq_words = \'the, a, you, i\' bigram_index = \'first_freq\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) bigram_freq_words = \'the, a, you, i\' bigram_index = \'first_freq\'');
utilsApi.sql("CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'");
utilsApi.Sql("CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'");
table products {
bigram_freq_words = the, a, you, i
bigram_index = first_freq
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
dict = {keywords|crc}
The type of keywords dictionary used is identified by one of two known values, 'crc' or 'keywords'. This is optional, with 'keywords' as the default.
Using the keywords dictionary mode (dict=keywords) can significantly decrease the indexing burden and enable substring searches on extensive collections. This mode can be utilized for both plain and RT tables.
CRC dictionaries do not store the original keyword text in the index. Instead, they replace keywords with a control sum value (computed using FNV64) during both searching and indexing processes. This value is used internally within the index. This approach has two disadvantages:
The keywords dictionary resolves both of these issues. It stores keywords in the index and performs search-time wildcard expansion. For instance, a search for a test* prefix could internally expand to a 'test|tests|testing' query based on the dictionary's contents. This expansion process is entirely invisible to the application, with the exception that the separate per-keyword statistics for all the matched keywords are now also reported.
For substring (infix) searches, extended wildcards can be used. Special characters such as ? and % are compatible with substring (infix) search (e.g., t?st*, run%, *abc*). Note that the wildcards operators and the REGEX only function with dict=keywords.
Indexing with a keywords dictionary is approximately 1.1x to 1.3x slower than regular, non-substring indexing - yet significantly faster than substring indexing (either prefix or infix). The index size should only be slightly larger than that of the standard non-substring table, with a total difference of 1..10% percent. The time it takes for regular keyword searching should be nearly the same or identical across all three index types discussed (CRC non-substring, CRC substring, keywords). Substring searching time can significantly fluctuate based on how many actual keywords match the given substring (i.e., how many keywords the search term expands into). The maximum number of matched keywords is limited by the expansion_limit directive.
In summary, keywords and CRC dictionaries offer two different trade-off decisions for substring searching. You can opt to either sacrifice indexing time and index size to achieve the fastest worst-case searches (CRC dictionary), or minimally impact indexing time but sacrifice worst-case searching time when the prefix expands into a high number of keywords (keywords dictionary).
CREATE TABLE products(title text, price float) dict = 'keywords'
POST /cli -d "
CREATE TABLE products(title text, price float) dict = 'keywords'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'dict' => 'keywords'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) dict = \'keywords\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) dict = \'keywords\'');
utilsApi.sql("CREATE TABLE products(title text, price float) dict = 'keywords'");
utilsApi.Sql("CREATE TABLE products(title text, price float) dict = 'keywords'");
table products {
dict = keywords
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
embedded_limit = size
Embedded exceptions, wordforms, or stop words file size limit. Optional, default is 16K.
When you create a table the above mentioned files can be either saved externally along with the table or embedded directly into the table. Files sized under embedded_limit get stored into the table. For bigger files, only the file names are stored. This also simplifies moving table files to a different machine; you may get by just copying a single file.
With smaller files, such embedding reduces the number of the external files on which the table depends, and helps maintenance. But at the same time it makes no sense to embed a 100 MB wordforms dictionary into a tiny delta table. So there needs to be a size threshold, and embedded_limit is that threshold.
table products {
embedded_limit = 32K
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
global_idf = /path/to/global.idf
The path to a file with global (cluster-wide) keyword IDFs. Optional, default is empty (use local IDFs).
On a multi-table cluster, per-keyword frequencies are quite likely to differ across different tables. That means that when the ranking function uses TF-IDF based values, such as BM25 family of factors, the results might be ranked slightly differently depending on what cluster node they reside.
The easiest way to fix that issue is to create and utilize a global frequency dictionary, or a global IDF file for short. This directive lets you specify the location of that file. It is suggested (but not required) to use an .idf extension. When the IDF file is specified for a given table and OPTION global_idf is set to 1, the engine will use the keyword frequencies and collection documents counts from the global_idf file, rather than just the local table. That way, IDFs and the values that depend on them will stay consistent across the cluster.
IDF files can be shared across multiple tables. Only a single copy of an IDF file will be loaded by searchd, even when many tables refer to that file. Should the contents of an IDF file change, the new contents can be loaded with a SIGHUP.
You can build an .idf file using indextool utility, by dumping dictionaries using --dumpdict dict.txt --stats switch first, then converting those to .idf format using --buildidf, then merging all the .idf files across cluster using --mergeidf.
CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'
POST /cli -d "
CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'global_idf' => '/usr/local/manticore/var/global.idf'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) global_idf = \'/usr/local/manticore/var/global.idf\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) global_idf = \'/usr/local/manticore/var/global.idf\'');
utilsApi.sql("CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'");
utilsApi.Sql("CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'");
table products {
global_idf = /usr/local/manticore/var/global.idf
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
hitless_words = {all|path/to/file}
Hitless words list. Optional, allowed values are 'all', or a list file name.
By default, Manticore full-text index stores not only a list of matching documents for every given keyword, but also a list of its in-document positions (known as hitlist). Hitlists enables phrase, proximity, strict order and other advanced types of searching, as well as phrase proximity ranking. However, hitlists for specific frequent keywords (that can not be stopped for some reason despite being frequent) can get huge and thus slow to process while querying. Also, in some cases we might only care about boolean keyword matching, and never need position-based searching operators (such as phrase matching) nor phrase ranking.
hitless_words lets you create indexes that either do not have positional information (hitlists) at all, or skip it for specific keywords.
Hitless index will generally use less space than the respective regular full-text index (about 1.5x can be expected). Both indexing and searching should be faster, at a cost of missing positional query and ranking support.
If used in positional queries (e.g. phrase queries) the hitless words are taken out from them and used as operand without a position. For example if "hello" and "world" are hitless and "simon" and "says" are not hitless, the phrase query "simon says hello world" will be converted to ("simon says" & hello & world), matching "hello" and "world" anywhere in the document and "simon says" as an exact phrase.
A positional query than contains only hitless words will result in an empty phrase node, therefore the entire query will return an empty result and a warning. If the whole dictionary is hitless (using all) only boolean matching can be used on the respective index.
CREATE TABLE products(title text, price float) hitless_words = 'all'
POST /cli -d "
CREATE TABLE products(title text, price float) hitless_words = 'all'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'hitless_words' => 'all'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) hitless_words = \'all\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) hitless_words = \'all\'');
utilsApi.sql("CREATE TABLE products(title text, price float) hitless_words = 'all'");
utilsApi.Sql("CREATE TABLE products(title text, price float) hitless_words = 'all'");
table products {
hitless_words = all
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
index_field_lengths = {0|1}
Enables computing and storing of field lengths (both per-document and average per-index values) into the full-text index. Optional, default is 0 (do not compute and store).
When index_field_lengths is set to 1 Manticore will:
__len suffixBM25A() and BM25F() functions in the expression ranker are based on these lengths and require index_field_lengths to be enabled. Historically, Manticore used a simplified, stripped-down variant of BM25 that, unlike the complete function, did not account for document length. There's also support for both a complete variant of BM25, and its extension towards multiple fields, called BM25F. They require per-document length and per-field lengths, respectively. Hence the additional directive.
CREATE TABLE products(title text, price float) index_field_lengths = '1'
POST /cli -d "
CREATE TABLE products(title text, price float) index_field_lengths = '1'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'index_field_lengths' => '1'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) index_field_lengths = \'1\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_field_lengths = \'1\'');
utilsApi.sql("CREATE TABLE products(title text, price float) index_field_lengths = '1'");
utilsApi.Sql("CREATE TABLE products(title text, price float) index_field_lengths = '1'");
table products {
index_field_lengths = 1
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
index_token_filter = my_lib.so:custom_blend:chars=@#&
Index-time token filter for full-text indexing. Optional, default is empty.
The index_token_filter directive specifies an optional index-time token filter for full-text indexing. This directive is used to create a custom tokenizer that makes tokens according to custom rules. The filter is created by the indexer on indexing source data into a plain table or by an RT table on processing INSERT or REPLACE statements. The plugins are defined using the format library name:plugin name:optional string of settings. For example, my_lib.so:custom_blend:chars=@#&.
CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'
POST /cli -d "
CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'index_token_filter' => 'my_lib.so:custom_blend:chars=@#&'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) index_token_filter = \'my_lib.so:custom_blend:chars=@#&\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_token_filter = \'my_lib.so:custom_blend:chars=@#&\'');
utilsApi.sql("CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'");
utilsApi.Sql("CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'");
table products {
index_token_filter = my_lib.so:custom_blend:chars=@#&
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
overshort_step = {0|1}
Position increment on overshort (less than min_word_len) keywords. Optional, allowed values are 0 and 1, default is 1.
CREATE TABLE products(title text, price float) overshort_step = '1'
POST /cli -d "
CREATE TABLE products(title text, price float) overshort_step = '1'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'overshort_step' => '1'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) overshort_step = \'1\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) overshort_step = \'1\'');
utilsApi.sql("CREATE TABLE products(title text, price float) overshort_step = '1'");
utilsApi.Sql("CREATE TABLE products(title text, price float) overshort_step = '1'");
table products {
overshort_step = 1
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
phrase_boundary = ., ?, !, U+2026 # horizontal ellipsis
Phrase boundary characters list. Optional, default is empty.
This list controls what characters will be treated as phrase boundaries, in order to adjust word positions and enable phrase-level search emulation through proximity search. The syntax is similar to charset_table, but mappings are not allowed and the boundary characters must not overlap with anything else.
On phrase boundary, additional word position increment (specified by phrase_boundary_step) will be added to current word position. This enables phrase-level searching through proximity queries: words in different phrases will be guaranteed to be more than phrase_boundary_step distance away from each other; so proximity search within that distance will be equivalent to phrase-level search.
Phrase boundary condition will be raised if and only if such character is followed by a separator; this is to avoid abbreviations such as S.T.A.L.K.E.R or URLs being treated as several phrases.
CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'
POST /cli -d "
CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'phrase_boundary' => '., ?, !, U+2026',
'phrase_boundary_step' => '10'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) phrase_boundary = \'., ?, !, U+2026\' phrase_boundary_step = \'10\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) phrase_boundary = \'., ?, !, U+2026\' phrase_boundary_step = \'10\'');
utilsApi.sql("CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'");
utilsApi.Sql("CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'");
table products {
phrase_boundary = ., ?, !, U+2026
phrase_boundary_step = 10
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
phrase_boundary_step = 100
Phrase boundary word position increment. Optional, default is 0.
On phrase boundary, current word position will be additionally incremented by this number.
CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'
POST /cli -d "
CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'phrase_boundary_step' => '100',
'phrase_boundary' => '., ?, !, U+2026'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) phrase_boundary_step = \'100\' phrase_boundary = \'., ?, !, U+2026\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) phrase_boundary_step = \'100\' phrase_boundary = \'., ?, !, U+2026\'');
utilsApi.sql("CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'");
utilsApi.Sql("CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'");
table products {
phrase_boundary_step = 100
phrase_boundary = ., ?, !, U+2026
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
# index '13"' as '13inch'
regexp_filter = \b(\d+)\" => \1inch
# index 'blue' or 'red' as 'color'
regexp_filter = (blue|red) => color
Regular expressions (regexps) used to filter the fields and queries. This directive is optional, multi-valued, and its default is an empty list of regular expressions. The regular expressions engine used by Manticore Search is Google's RE2, which is known for its speed and safety. For detailed information on the syntax supported by RE2, you can visit the RE2 syntax guide.
In certain applications such as product search, there can be many ways to refer to a product, model, or property. For example, iPhone 3gs and iPhone 3 gs (or even iPhone3 gs) are very likely to refer to the same product. Another example could be different ways to express a laptop screen size, such as 13-inch, 13 inch, 13", or 13in.
Regexps provide a mechanism to specify rules tailored to handle such cases. In the first example, you could possibly use a wordforms file to handle a handful of iPhone models, but in the second example, it's better to specify rules that would normalize "13-inch" and "13in" to something identical.
Regular expressions listed in regexp_filter are applied in the order they are listed, at the earliest stage possible, before any other processing (including exceptions), even before tokenization. That is, regexps are applied to the raw source fields when indexing, and to the raw search query text when searching.
CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'
POST /cli -d "
CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'regexp_filter' => '(blue|red) => color'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) regexp_filter = \'(blue|red) => color\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) regexp_filter = \'(blue|red) => color\'');
utilsApi.sql("CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'");
utilsApi.Sql("CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'");
table products {
# index '13"' as '13inch'
regexp_filter = \b(\d+)\" => \1inch
# index 'blue' or 'red' as 'color'
regexp_filter = (blue|red) => color
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
Wildcard searching is a common text search type. In Manticore, it is performed at the dictionary level. By default, both plain and RT tables use a dictionary type called dict. In this mode, words are stored as they are, so enabling wildcarding does not affect the size of the table. When a wildcard search is performed, the dictionary is searched to find all possible expansions of the wildcarded word. This expansion can be problematic in terms of computation at query time when the expanded word provides many expansions or expansions that have huge hitlists, especially in the case of infixes where the wildcard is added at the start and end of the word. To avoid such problems, the expansion_limit can be used.
min_prefix_len = length
This setting determines the minimum word prefix length to index and search. By default, it is set to 0, meaning prefixes are not allowed.
Prefixes allow for wildcard searching by wordstart* wildcards.
For example, if the word "example" is indexed with min_prefix_len=3, it can be found by searching for "exa", "exam", "examp", "exampl", as well as the full word.
Note that with dict=crc min_prefix_len will affect the size of the full-text index since each word expansion will be stored additionally.
Manticore can differentiate perfect word matches from prefix matches and rank the former higher if the following conditions are met:
Note that with either dict=crc mode or any of the above options disabled, it is not possible to differentiate between prefixes and full words, and perfect word matches cannot be ranked higher.
When the minimum infix length is set to a positive number, the minimum prefix length is always considered 1.
CREATE TABLE products(title text, price float) min_prefix_len = '3'
POST /cli -d "
CREATE TABLE products(title text, price float) min_prefix_len = '3'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'min_prefix_len' => '3'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) min_prefix_len = \'3\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_prefix_len = \'3\'');
utilsApi.sql("CREATE TABLE products(title text, price float) min_prefix_len = '3'");
utilsApi.Sql("CREATE TABLE products(title text, price float) min_prefix_len = '3'");
table products {
min_prefix_len = 3
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
min_infix_len = length
The min_infix_len setting determines the minimum length of an infix prefix to index and search. It is optional and its default value is 0, which means that infixes are not allowed. The minimum allowed non-zero value is 2.
When enabled, infixes allow for wildcard searching with term patterns like start*, *end, *middle*, , and so on. It also allows you to disable too short wildcards if they are too expensive to search for.
If the following conditions are met, Manticore can differentiate perfect word matches from infix matches and rank the former higher:
Note that with the dict=crc mode or any of the above options disabled, there is no way to differentiate between infixes and full words, and thus perfect word matches cannot be ranked higher.
Infix wildcard search query time can vary greatly, depending on how many keywords the substring will actually expand to. Short and frequent syllables like *in* or *ti* might expand to way too many keywords, all of which would need to be matched and processed. Therefore, to generally enable substring searches, you would set min_infix_len to 2. To limit the impact from wildcard searches with too short wildcards, you might set it higher.
Infixes must be at least 2 characters long, and wildcards like *a* are not allowed for performance reasons.
When min_infix_len is set to a positive number, the minimum prefix length is considered 1. For dict word infixing and prefixing cannot be both enabled at the same time. For dict and other fields to have prefixes declared with prefix_fields, it is forbidden to declare the same field in both lists.
If dict=keywords, besides the wildcard * two other wildcard characters can be used:
? can match any (one) character: t?st will match test, but not teast% can match zero or one character: tes% will match tes or test, but not testingCREATE TABLE products(title text, price float) min_infix_len = '3'
POST /cli -d "
CREATE TABLE products(title text, price float) min_infix_len = '3'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'min_infix_len' => '3'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) min_infix_len = \'3\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_infix_len = \'3\'');
utilsApi.sql("CREATE TABLE products(title text, price float) min_infix_len = '3'");
utilsApi.Sql("CREATE TABLE products(title text, price float) min_infix_len = '3'");
table products {
min_infix_len = 3
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
prefix_fields = field1[, field2, ...]
The prefix_fields setting is used to limit prefix indexing to specific full-text fields in dict=crc mode. By default, all fields are indexed in prefix mode, but because prefix indexing can affect both indexing and searching performance, it may be desired to limit it to certain fields.
To limit prefix indexing to specific fields, use the prefix_fields setting followed by a comma-separated list of field names. If prefix_fields is not set, then all fields will be indexed in prefix mode.
table products {
prefix_fields = title, name
min_prefix_len = 3
dict = crc
infix_fields = field1[, field2, ...]
The infix_fields setting allows you to specify a list of full-text fields to limit infix indexing to. This applies to dict=crc only and is optional, with the default being to index all fields in infix mode.
This setting is similar to prefix_fields, but instead allows you to limit infix indexing to specific fields.
table products {
infix_fields = title, name
min_infix_len = 3
dict = crc
max_substring_len = length
The max_substring_len directive sets the maximum substring length to be indexed for either prefix or infix searches. This setting is optional, and its default value is 0 (which means that all possible substrings are indexed). It only applies to dict.
By default, substring indexing in dict indexes all possible substrings as separate keywords, which can result in an overly large full-text index. Therefore, the max_substring_len directive allows you to skip too-long substrings that will probably never be searched for.
For example, a test table of 10,000 blog posts takes up a different amount of disk space depending on the settings:
Therefore, limiting the max substring length can save 10-15% of the table size.
When using dict=keywords mode, there is no performance impact associated with substring length. Therefore, this directive is not applicable and is intentionally forbidden in that case. However, if required, you can still limit the length of a substring that you search for in the application code.
table products {
max_substring_len = 12
min_infix_len = 3
dict = crc
expand_keywords = {0|1|exact|star}
This setting expands keywords with their exact forms and/or with stars when possible. The supported values are:
running will become (running | *running* | =running)exact - - augment the keyword with only its exact form. For instance, running will become (running | =running)star - augment the keyword by adding * around it. For instance, running will become (running | *running*)Queries against tables with expand_keywords feature enabled are internally expanded as follows: if the table was built with prefix or infix indexing enabled, every keyword gets internally replaced with a disjunction of the keyword itself and a respective prefix or infix (keyword with stars). If the table was built with both stemming and index_exact_words enabled, exact form is also added.
CREATE TABLE products(title text, price float) expand_keywords = '1'
POST /cli -d "
CREATE TABLE products(title text, price float) expand_keywords = '1'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'expand_keywords' => '1'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) expand_keywords = \'1\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) expand_keywords = \'1\'');
utilsApi.sql("CREATE TABLE products(title text, price float) expand_keywords = '1'");
utilsApi.Sql("CREATE TABLE products(title text, price float) expand_keywords = '1'");
table products {
expand_keywords = 1
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
Expanded queries take naturally longer to complete, but can possibly improve the search quality, as the documents with exact form matches should be ranked generally higher than documents with stemmed or infix matches.
Note that the existing query syntax does not allow to emulate this kind of expansion, because internal expansion works on keyword level and expands keywords within phrase or quorum operators too (which is not possible through the query syntax). Take a look at the examples and how expand_keywords affects the search result weights and how "runsy" is found by "runs" w/o the need to add a star:
mysql> create table t(f text) min_infix_len='2' expand_keywords='1' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)
mysql> select *, weight() from t where match('runs');
+------+---------+----------+
| id | f | weight() |
+------+---------+----------+
| 2 | runs | 1560 |
| 1 | running | 1500 |
| 3 | runsy | 1500 |
+------+---------+----------+
3 rows in set (0.01 sec)
mysql> drop table t;
Query OK, 0 rows affected (0.01 sec)
mysql> create table t(f text) min_infix_len='2' expand_keywords='exact' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)
mysql> select *, weight() from t where match('running');
+------+---------+----------+
| id | f | weight() |
+------+---------+----------+
| 1 | running | 1590 |
| 2 | runs | 1500 |
+------+---------+----------+
2 rows in set (0.00 sec)
mysql> create table t(f text) min_infix_len='2' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)
mysql> select *, weight() from t where match('runs');
+------+---------+----------+
| id | f | weight() |
+------+---------+----------+
| 1 | running | 1500 |
| 2 | runs | 1500 |
+------+---------+----------+
2 rows in set (0.00 sec)
mysql> drop table t;
Query OK, 0 rows affected (0.01 sec)
mysql> create table t(f text) min_infix_len='2' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)
mysql> select *, weight() from t where match('running');
+------+---------+----------+
| id | f | weight() |
+------+---------+----------+
| 1 | running | 1500 |
| 2 | runs | 1500 |
+------+---------+----------+
2 rows in set (0.00 sec)
This directive does not affect indexer in any way, it only affects searchd.
expansion_limit = number
Maximum number of expanded keywords for a single wildcard. Details are here.
Stop words are words that are ignored during indexing and searching, typically due to their high frequency and low value to search results.
Manticore Search applies stemming to stop words by default, which can lead to undesired results, but this can be turned off using the stopwords_unstemmed.
Small stop word files are stored in the table header, and there is a limit to the size of files that can be embedded, as defined by the embedded_limit option.
Stop words are not indexed, but they do affect keyword positions. For example, if "the" is a stop word, and document 1 contains the phrase "in office" while document 2 contains the phrase "in the office," searching for "in office" as an exact phrase will only return the first document, even though "the" is skipped as a stop word in the second document. This behavior can be modified using the stopword_step directive.
stopwords=path/to/stopwords/file[ path/to/another/file ...]
The stopwords setting is optional and by default empty. It allows you to specify the path to one or more stop word files, separated by spaces. All the files will be loaded. In the real-time mode, only absolute paths are allowed.
The stop word file format is simple plain text with UTF-8 encoding. The file data will be tokenized with respect to the charset_table settings, so you can use the same separators as in the indexed data.
Stop word files can be created manually or semi-automatically. The indexer provides a mode that creates a frequency dictionary of the table, sorted by the keyword frequency. Top keywords from that dictionary can usually be used as stop words. See --buildstops and --buildfreqs switch for details. Top keywords from that dictionary can usually be used as stop words.
CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'
POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt stopwords-ru.txt stopwords-en.txt'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'stopwords' => '/usr/local/manticore/data/stopwords.txt stopwords-ru.txt stopwords-en.txt'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt\'');
utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'");
utilsApi.Sql("CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'");
table products {
stopwords = /usr/local/manticore/data/stopwords.txt
stopwords = stopwords-ru.txt stopwords-en.txt
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
Alternatively you can use one of the default stop word files that come with Manticore. Currently stop words for 50 languages are available. Here is the full list of aliases for them:
For example, to use stop words for Italian language just put the following line in your config file:
CREATE TABLE products(title text, price float) stopwords = 'it'
POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'it'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'stopwords' => 'it'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'it\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'it\'');
utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = 'it'");
utilsApi.Sql("CREATE TABLE products(title text, price float) stopwords = 'it'");
table products {
stopwords = it
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
If you need to use stop words for multiple languages you should list all their aliases, separated with commas (RT mode) or spaces (plain mode):
CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'
POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'stopwords' => 'en, it, ru'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en, it, ru\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en, it, ru\'');
utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'");
utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'");
table products {
stopwords = en it ru
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
stopword_step={0|1}
The position_increment setting on stopwords is optional, and the allowed values are 0 and 1, with the default being 1.
CREATE TABLE products(title text, price float) stopwords = 'en' stopword_step = '1'
POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'en' stopword_step = '1'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'stopwords' => 'en, it, ru',
'stopword_step' => '1'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'');
utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'");
utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'");
table products {
stopwords = en
stopword_step = 1
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
stopwords_unstemmed={0|1}
Whether to apply stop words before or after stemming. Optional, default is 0 (apply stop word filter after stemming).
By default, stop words are stemmed themselves, and then applied to tokens after stemming (or any other morphology processing). This means that a token is stopped when stem(token) is equal to stem(stopword). This default behavior can lead to unexpected results when a token is erroneously stemmed to a stopped root. For example, "Andes" might get stemmed to "and", so when "and" is a stopword, "Andes" is also skipped.
However, you can change this behavior by enabling the stopwords_unstemmed directive. When this is enabled, stop words are applied before stemming (and therefore to the original word forms), and the tokens are skipped when the token is equal to the stopword.
CREATE TABLE products(title text, price float) stopwords = 'en' stopwords_unstemmed = '1'
POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'en' stopwords_unstemmed = '1'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'stopwords' => 'en, it, ru',
'stopwords_unstemmed' => '1'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'');
utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'");
utilsApi.Sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'");
table products {
stopwords = en
stopwords_unstemmed = 1
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
Word forms are applied after tokenizing incoming text by charset_table rules. They essentially let you replace one word with another. Normally, that would be used to bring different word forms to a single normal form (e.g. to normalize all the variants such as "walks", "walked", "walking" to the normal form "walk"). It can also be used to implement stemming exceptions, because stemming is not applied to words found in the forms list.
wordforms = path/to/wordforms.txt
wordforms = path/to/alternateforms.txt
wordforms = path/to/dict*.txt
Word forms dictionary. Optional, default is empty.
The dictionaries are used to normalize incoming words both during indexing and searching. Therefore, when it comes to a plain table, it's required to rotate the table in order to pick up changes in the word forms file.
Word forms support in Manticore is designed to handle large dictionaries well. They moderately affect indexing speed; for example, a dictionary with 1 million entries slows down indexing by about 1.5 times. Searching speed is not affected at all. The additional RAM impact is roughly equal to the dictionary file size, and dictionaries are shared across tables. For instance, if the very same 50 MB word forms file is specified for 10 different tables, the additional searchd RAM usage will be about 50 MB.
The dictionary file should be in a simple plain text format. Each line should contain source and destination word forms, in UTF-8 encoding, separated by a "greater than" sign. Rules from the charset_table will be applied when the file is loaded, so if you are using built-in charset_table options, it is typically case-insensitive, just like your other full-text indexed data. Here is a sample file contents:
walks > walk
walked > walk
walking > walk
There is a bundled utility called Spelldump that helps you create a dictionary file in a format that Manticore can read. The utility can read from source .dict and .aff dictionary files in the ispell or MySpell format, as bundled with OpenOffice.
You can map several source words to a single destination word. The process happens on tokens, not the source text, so differences in whitespace and markup are ignored.
You can use the => symbol instead of >. Comments (starting with #) are also allowed. Finally, if a line starts with a tilde (~), the wordform will be applied after morphology, instead of before (note that only a single source and destination word are supported in this case).
core 2 duo > c2d
e6600 > c2d
core 2duo => c2d # Some people write '2duo' together...
~run > walk # Along with stem_en morphology enabled replaces 'run', 'running', 'runs' (and any other words that stem to just 'run') to 'walk'
You can specify multiple destination tokens:
s02e02 > season 2 episode 2
s3 e3 > season 3 episode 3
You can specify multiple files, not just one. Masks can be used as a pattern, and all matching files will be processed in simple ascending order.
In the RT mode, only absolute paths are allowed.
If multi-byte codepages are used and file names include foreign characters, the resulting order may not be exactly alphabetic. If the same wordform definition is found in multiple files, the latter one is used and overrides previous definitions.
CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt /var/lib/manticore/dict*.txt'
POST /cli -d "
CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'wordforms' => [
'/var/lib/manticore/wordforms.txt',
'/var/lib/manticore/alternateforms.txt',
'/var/lib/manticore/dict*.txt'
]
]);
utilsApi.sql('CREATE TABLE products(title text, price float) wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float)wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'');
utilsApi.sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'");
utilsApi.Sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'");
table products {
wordforms = /var/lib/manticore/wordforms.txt
wordforms = /var/lib/manticore/alternateforms.txt
wordforms = /var/lib/manticore/dict*.txt
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
Exceptions (also known as synonyms) allow mapping one or more tokens (including tokens with characters that would normally be excluded) to a single keyword. They are similar to wordforms in that they also perform mapping but have a number of important differences.
A short summary of the differences from wordforms is as follows:
| Exceptions | Word forms |
|---|---|
| Case sensitive | Case insensitive |
| Can use special characters that are not in charset_table | Fully obey charset_table |
| Underperform on huge dictionaries | Designed to handle millions of entries |
exceptions = path/to/exceptions.txt
Tokenizing exceptions file. Optional, the default is empty.
In the RT mode, only absolute paths are allowed.
The expected file format is plain text, with one line per exception. The line format is as follows:
map-from-tokens => map-to-token
Example file:
at & t => at&t
AT&T => AT&T
Standarten Fuehrer => standartenfuhrer
Standarten Fuhrer => standartenfuhrer
MS Windows => ms windows
Microsoft Windows => ms windows
C++ => cplusplus
c++ => cplusplus
C plus plus => cplusplus
All tokens here are case sensitive and will not be processed by charset_table rules. Thus, with the example exceptions file above, the "at&t" text will be tokenized as two keywords "at" and "t" due to lowercase letters. On the other hand, "AT&T" will match exactly and produce a single "AT&T" keyword.
Note that this map-to keyword:
In our sample, "ms windows" query will not match the document with "MS Windows" text. The query will be interpreted as a query for two keywords, "ms" and "windows". The mapping for "MS Windows" is a single keyword "ms windows", with a space in the middle. On the other hand, "standartenfuhrer" will retrieve documents with "Standarten Fuhrer" or "Standarten Fuehrer" contents (capitalized exactly like this), or any capitalization variant of the keyword itself, e.g., "staNdarTenfUhreR". (It won't catch "standarten fuhrer", however: this text does not match any of the listed exceptions because of case sensitivity and gets indexed as two separate keywords.)
The whitespace in the map-from tokens list matters, but its amount does not. Any amount of whitespace in the map-form list will match any other amount of whitespace in the indexed document or query. For instance, the "AT & T" map-from token will match "AT & T" text, whatever the amount of space in both map-from part and the indexed text. Such text will, therefore, be indexed as a special "AT&T" keyword, thanks to the very first entry from the sample.
Exceptions also allow capturing special characters (that are exceptions from general charset_table rules; hence the name). Assume that you generally do not want to treat '+' as a valid character, but still want to be able to search for some exceptions from this rule such as 'C++'. The sample above will do just that, totally independent of what characters are in the table and what are not.
Therefore, when it comes to a plain table, it's required to rotate the table in order to pick up changes in the exceptions file.
CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'
POST /cli -d "
CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'exceptions' => '/usr/local/manticore/data/exceptions.txt'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'');
utilsApi.sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'");
utilsApi.Sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'");
table products {
exceptions = /usr/local/manticore/data/exceptions.txt
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
Morphology preprocessors can be applied to words during indexing to normalize different forms of the same word and improve segmentation. For example, an English stemmer can normalize "dogs" and "dog" to "dog", resulting in identical search results for both keywords.
Manticore has four built-in morphology preprocessors:
morphology = morphology1[, morphology2, ...]
The morphology directive specifies a list of morphology preprocessors to apply to the words being indexed. This is an optional setting, with the default being no preprocessor applied.
Manticore comes with built-in morphological preprocessors for:
Lemmatizers require dictionary .pak files that can be downloaded from the Manticore website. The dictionaries need to be put in the directory specified by lemmatizer_base. Additionally, the lemmatizer_cache setting can be used to speed up lemmatizing by spending more RAM for an uncompressed dictionary cache.
The Chinese language segmentation can be performed using ICU. It provides more precise segmentation compared to n-grams but is slightly slower. The charset_table must include all Chinese characters, which can be done by using the "cjk" alias. When "morphology=icu_chinese" is specified, the documents are first pre-processed by ICU. Then, the result is processed by the tokenizer according to the charset_table, and finally, other morphology processors specified in the "morphology" option are applied. Only those parts of texts that contain Chinese are passed to ICU for segmentation, while others can be modified by different means such as different morphologies or charset_table.
Built-in English and Russian stemmers are faster than their libstemmer counterparts but may produce slightly different results
Soundex implementation matches that of MySQL. Metaphone implementation is based on Double Metaphone algorithm and indexes the primary code.
To use the morphology option, specify one or multiple of the built-in options, including:
charset_table since by default they are not. For that override them, like this: charset_table='non_cjk,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491'. Here is an interactive course about how to install and use the urkainian lemmatizer.charset_table.Multiple stemmers can be specified, separated by commas. They will be applied to incoming words in the order they are listed, and the processing will stop once one of the stemmers modifies the word. Additionally, when wordforms feature is enabled, the word will be looked up in the word forms dictionary first. If there is a matching entry in the dictionary, stemmers will not be applied at all. wordforms сan be used to implement stemming exceptions.
CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'
POST /cli -d "CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'morphology' => 'stem_en, libstemmer_sv'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'');
utilsApi.sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'");
utilsApi.Sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'");
table products {
morphology = stem_en, libstemmer_sv
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
morphology_skip_fields = field1[, field2, ...]
A list of fields to skip morphology preprocessing. Optional, default is empty (apply preprocessors to all fields).
CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'
POST /cli -d "
CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'morphology_skip_fields' => 'name',
'morphology' => 'stem_en'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'');
utilsApi.sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'");
utilsApi.Sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'");
table products {
morphology_skip_fields = name
morphology = stem_en
type = rt
path = tbl
rt_field = title
rt_field = name
rt_attr_uint = price
}
min_stemming_len = length
Minimum word length at which to enable stemming. Optional, default is 1 (stem everything).
Stemmers are not perfect, and might sometimes produce undesired results. For instance, running "gps" keyword through Porter stemmer for English results in "gp", which is not really the intent. min_stemming_len feature lets you suppress stemming based on the source word length, ie. to avoid stemming too short words. Keywords that are shorter than the given threshold will not be stemmed. Note that keywords that are exactly as long as specified will be stemmed. So in order to avoid stemming 3-character keywords, you should specify 4 for the value. For more finely grained control, refer to wordforms feature.
CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'
POST /cli -d "
CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'min_stemming_len' => '4',
'morphology' => 'stem_en'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'');
utilsApi.sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'");
utilsApi.Sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'");
table products {
min_stemming_len = 4
morphology = stem_en
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
index_exact_words = {0|1}
This option allows for the indexing of original keywords along with their morphologically modified versions. However, original keywords that are remapped by the wordforms and exceptions cannot be indexed. The default value is 0, indicating that this feature is disabled by default.
This allows the use of the exact form operator in the query language. Enabling this feature will increase the full-text index size and indexing time, but will not impact search performance.
CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'
POST /cli -d "
CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'index_exact_words' => '1',
'morphology' => 'stem_en'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'');
utilsApi.sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'");
utilsApi.Sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'");
table products {
index_exact_words = 1
morphology = stem_en
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
html_strip = {0|1}
This option determines whether HTML markup should be stripped from the incoming full-text data. The default value is 0, which disables stripping. To enable stripping, set the value to 1.
HTML tags and entities are considered as markup and will be processed.
HTML tags are removed, while the contents between them (e.g. everything between <p> and </p>) are left intact. You can choose to keep and index tag attributes (e.g. HREF attribute in an A tag or ALT in an IMG tag). Some well-known inline tags, such as A, B, I, S, U, BASEFONT, BIG, EM, FONT, IMG, LABEL, SMALL, SPAN, STRIKE, STRONG, SUB, SUP, and TT, are completely removed. All other tags are treated as block level and are replaced with whitespace. For example, the text te<b>st</b> will be indexed as a single keyword 'test', while te<p>st</p> will be indexed as two keywords 'te' and 'st'.
HTML entities are decoded and replaced with their corresponding UTF-8 characters. The stripper supports both numeric forms (e.g. ï) and text forms (e.g. ó or ) of entities, and supports all entities specified by the HTML4 standard.
The stripper is designed to work with properly formed HTML and XHTML, but may produce unexpected results on malformed input (such as HTML with stray <'s or unclosed >'s).
Please note that only the tags themselves, as well as HTML comments, are stripped. To strip the contents of the tags, including embedded scripts, see the html_remove_elements option. There are no restrictions on tag names, meaning that everything that looks like a valid tag start, end, or comment will be stripped.
CREATE TABLE products(title text, price float) html_strip = '1'
POST /cli -d "
CREATE TABLE products(title text, price float) html_strip = '1'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'html_strip' => '1'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) html_strip = \'1\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_strip = \'1\'');
utilsApi.sql("CREATE TABLE products(title text, price float) html_strip = '1'");
utilsApi.Sql("CREATE TABLE products(title text, price float) html_strip = '1'");
table products {
html_strip = 1
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
html_index_attrs = img=alt,title; a=title;
The html_index_attrs option allows you to specify which HTML markup attributes should be indexed even though other HTML markup is stripped. The default value is empty, meaning no attributes will be indexed.
The format of the option is a per-tag enumeration of indexable attributes, as demonstrated in the example above. The contents of the specified attributes will be retained and indexed, providing a way to extract additional information from your full-text data.
CREATE TABLE products(title text, price float) html_index_attrs = 'img=alt,title; a=title;' html_strip = '1'
POST /cli -d "
CREATE TABLE products(title text, price float) html_index_attrs = 'img=alt,title; a=title;' html_strip = '1'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'html_index_attrs' => 'img=alt,title; a=title;',
'html_strip' => '1'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = \'1\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = \'1\'');
utilsApi.sql("CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = '1'");
utilsApi.Sql("CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = '1'");
table products {
html_index_attrs = img=alt,title; a=title;
html_strip = 1
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
html_remove_elements = element1[, element2, ...]
A list of HTML elements whose contents, along with the elements themselves, will be stripped. Optional, the default is an empty string (do not strip contents of any elements).
This option allows you to remove the contents of elements, meaning everything between the opening and closing tags. It is useful for removing embedded scripts, CSS, etc. The short tag form for empty elements (e.g.
) is properly supported, and the text following such a tag will not be removed.
The value is a comma-separated list of element (tag) names, the contents of which should be removed. Tag names are case-insensitive.
CREATE TABLE products(title text, price float) html_remove_elements = 'style, script' html_strip = '1'
POST /cli -d "
CREATE TABLE products(title text, price float) html_remove_elements = 'style, script' html_strip = '1'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'html_remove_elements' => 'style, script',
'html_strip' => '1'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = \'1\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = \'1\'');
utilsApi.sql("CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = '1'");
utilsApi.Sql("CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = '1'");
table products {
html_remove_elements = style, script
html_strip = 1
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
index_sp = {0|1}
Controls detection and indexing of sentence and paragraph boundaries. Optional, default is 0 (no detection or indexing).
This directive enables the detection and indexing of sentence and paragraph boundaries, making it possible for the SENTENCE and PARAGRAPH operators to work. Sentence boundary detection is based on plain text analysis, and only requires setting index_sp = 1 to enable it. Paragraph detection, however, relies on HTML markup and occurs during the [HTML stripping process](#html_strip. As such, to index paragraph boundaries, both the index_sp directive and the html_strip directive must be set to 1.
The following rules are used to determine sentence boundaries:
Paragraph boundaries are detected at every block-level HTML tag, including: ADDRESS, BLOCKQUOTE, CAPTION, CENTER, DD, DIV, DL, DT, H1, H2, H3, H4, H5, LI, MENU, OL, P, PRE, TABLE, TBODY, TD, TFOOT, TH, THEAD, TR, and UL.
Both sentences and paragraphs increment the keyword position counter by 1.
CREATE TABLE products(title text, price float) index_sp = '1' html_strip = '1'
POST /cli -d "
CREATE TABLE products(title text, price float) index_sp = '1' html_strip = '1'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'index_sp' => '1',
'html_strip' => '1'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = \'1\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = \'1\'');
utilsApi.sql("CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = '1'");
utilsApi.Sql("CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = '1'");
table products {
index_sp = 1
html_strip = 1
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
index_zones = h*, th, title
A list of HTML/XML zones within a field to be indexed. The default is an empty string (no zones will be indexed).
A "zone" is defined as everything between an opening and a matching closing tag, and all spans sharing the same tag name are referred to as a "zone." For example, everything between <H1> and </H1> in a document field belongs to the H1 zone.
The index_zones directive enables zone indexing, but the HTML stripper must also be enabled (by setting html_strip = 1). The value of index_zones should be a comma-separated list of tag names and wildcards (ending with a star) to be indexed as zones.
Zones can be nested and overlap, as long as every opening tag has a matching tag. Zones can also be used for matching with the ZONE operator, as described in the extended_query_syntax.
CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'
POST /cli -d "
CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'index_zones' => 'h*,th,title',
'html_strip' => '1'
]);
utilsApi.sql('CREATE TABLE products(title text, price float) index_zones = \'h, th, title\' html_strip = \'1\'')
res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_zones = \'h, th, title\' html_strip = \'1\'');
utilsApi.sql("CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'");
utilsApi.Sql("CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'");
table products {
index_zones = h*, th, title
html_strip = 1
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
Manticore allows for the creation of distributed tables, which act like regular plain or real-time tables, but are actually a collection of child tables used for searching. When a query is sent to a distributed table, it is distributed among all tables in the collection. The server then collects and processes the responses to sort and recalculate values of aggregates, if necessary.
From the client's perspective, it appears as if they are querying a single table.
Distributed tables can be composed of any combination of tables, including:
Mixing percolate and template tables with plain and real-time tables is not recommended.
A distributed table is defined as type 'distributed' in the configuration file or through the SQL clause CREATE TABLE
table foo {
type = distributed
local = bar
local = bar1, bar2
agent = 127.0.0.1:9312:baz
agent = host1|host2:tbl
agent = host1:9301:tbl1|host2:tbl2 [ha_strategy=random retry_count=10]
...
}
CREATE TABLE distributed_index type='distributed' local='local_index' agent='127.0.0.1:9312:remote_index'
The essence of a distributed table lies in its list of child tables, to which it points. There are two types of child tables in a distributed table:
Local tables: These are tables that are served within the same server as the distributed table. To enumerate local tables, you use the syntax local =. You can list several local tables using multiple local = lines, or combine them into one list separated by commas.
Remote tables: These are tables that are served anywhere outside the server. To enumerate remote tables, you use the syntax agent =. Each line represents one endpoint or agent. Each agent can have multiple external locations and options for how it should work. More details here. It is important to note that the server does not have any information about the type of table it is working with. This may lead to errors if, for example, you issue a CALL PQ to a remote table 'foo' that is not a percolate table.
A distributed table in Manticore Search acts as a "master node" that proxies the demanded query to other tables and provides merged results from the responses it receives. The table doesn't hold any data on its own. It can connect to both local tables and tables located on other servers. Here's an example of a simple distributed table:
table index_dist {
type = distributed
local = index1
local = index2
...
}
CREATE TABLE local_dist type='distributed' local='index1' local='index2';
$params = [
'body' => [
'settings' => [
'type' => 'distributed',
'local' => [
'index1',
'index2'
]
]
],
'index' => 'products'
];
$index = new \Manticoresearch\Index($client);
$index->create($params);
utilsApi.sql('CREATE TABLE local_dist type=\'distributed\' local=\'index1\' local=\'index2\'')
res = await utilsApi.sql('CREATE TABLE local_dist type=\'distributed\' local=\'index1\' local=\'index2\'');
utilsApi.sql("CREATE TABLE local_dist type='distributed' local='index1' local='index2'");
utilsApi.Sql("CREATE TABLE local_dist type='distributed' local='index1' local='index2'");
A remote table in Manticore Search is represented by the agent prefix in the definition of a distributed table. A distributed table can include a combination of local and remote tables. If there are no local tables provided, the distributed table will be purely remote and serve as a proxy only. For example, you might have a Manticore instance that listens on multiple ports and serves different protocols, and then redirects queries to backend servers that only accept connections via Manticore's internal binary protocol, using persistent connections to reduce the overhead of establishing connections.
Even though a purely remote distributed table doesn't serve local tables itself, it still consumes machine resources, as it still needs to perform final calculations, such as merging results and calculating final aggregated values.
agent = address1 [ | address2 [...] ][:table-list]
agent = address1[:table-list [ | address2[:table-list [...] ] ] ]
agent directive declares the remote agents that are searched each time the enclosing distributed table is searched. These agents are essentially pointers to networked tables. The value specified includes the address and can also include multiple alternatives (agent mirrors) for either the address only or the address and table list.
The address specification must be one of the following:
address = hostname[:port] # eg. server2:9312
address = /absolute/unix/socket/path # eg. /var/run/manticore2.sock
The hostname is the remote host name, port is the remote TCP port number, table-list is a comma-separated list of table names, and square brackets [] indicate an optional clause.
If the table name is omitted, it is assumed to be the same table as the one where this line is defined. In other words, when defining agents for the 'mycoolindex' distributed table, you can simply point to the address, and it will be assumed that you are querying the mycoolindex table on the agent's endpoints.
If the port number is omitted, it is assumed to be 9312. If it is defined but invalid (e.g. 70000), the agent will be skipped.
You can point each agent to one or more remote tables residing on one or more networked servers with no restrictions. This enables several different usage modes:
All agents are searched in parallel. The index list is passed verbatim to the remote agent. The exact way that list is searched within the agent (i.e. sequentially or in parallel) depends solely on the agent's configuration (see the threads setting). The master has no remote control over this.
It is important to note that the LIMIT, option is ignored in agent queries. This is because each agent can contain different tables, so it is the responsibility of the client to apply the limit to the final result set. This is why the query to a physical table is different from the query to a distributed table when viewed in the query logs. The query cannot be a simple copy of the original query, as this would not produce the correct results.
For example, if a client makes a query SELECT ... LIMIT 10, 10, and there are two agents, with the second agent having only 10 documents, broadcasting the original LIMIT 10, 10 query would result in receiving 0 documents from the second agent. However, LIMIT 10,10 should return documents 10-20 from the resulting set. To resolve this, the query must be sent to the agents with a broader limit, such as the default max_matches value of 1000.
For instance, if there is a distributed table dist that refers to the remote table user, a client query SELECT * FROM dist LIMIT 10,10 would be converted to SELECT * FROM user LIMIT 0,1000 and sent to the remote table user. Once the distributed table receives the result, it will apply the LIMIT 10,10 and return the requested 10 documents.
SELECT * FROM dist LIMIT 10,10;
the query will be converted to:
SELECT * FROM user LIMIT 0,1000
Additionally, the value can specify options for each individual agent, such as:
random, roundrobin, nodeads, noerrors (overrides the global ha_strategy setting for the particular agent)conn - pconn, persistent (equivalent to setting agent_persistent at the table level)blackhole 0,1 (identical to the agent_blackhole setting for the agent)retry_count an integer value (corresponding to agent_retry_count , but the provided value will not be multiplied by the number of mirrors)agent = address1:table-list[[ha_strategy=value, conn=value, blackhole=value]]
Example:
# config on box1
# sharding a table over 3 servers
agent = box2:9312:shard1
agent = box3:9312:shard2
# config on box2
# sharding a table over 3 servers
agent = box1:9312:shard2
agent = box3:9312:shard3
# config on box3
# sharding a table over 3 servers
agent = box1:9312:shard1
agent = box2:9312:shard3
# per agent options
agent = box1:9312:shard1[ha_strategy=nodeads]
agent = box2:9312:shard2[conn=pconn]
agent = box2:9312:shard2[conn=pconn,ha_strategy=nodeads]
agent = test:9312:any[blackhole=1]
agent = test:9312|box2:9312|box3:9312:any2[retry_count=2]
agent = test:9312|box2:9312:any2[retry_count=2,conn=pconn,ha_strategy=noerrors]
For optimal performance, it's recommended to place remote tables that reside on the same server within the same record. For instance, instead of:
agent = remote:9312:idx1
agent = remote:9312:idx2
you should prefer:
agent = remote:9312:idx1,idx2
agent_persistent = remotebox:9312:index2
The agent_persistent option allows you to persistently connect to an agent, meaning the connection will not be dropped after a query is executed. The syntax for this directive is the same as the agent directive. However, instead of opening a new connection to the agent for each query and then closing it, the master will keep a connection open and reuse it for subsequent queries. The maximum number of persistent connections per agent host is defined by the persistent_connections_limit option in the searchd section.
It's important to note that the persistent_connections_limit must be set to a value greater than 0 in order to use persistent agent connections. If it's not defined, it defaults to 0, and the agent_persistent directive will act the same as the agentdirective.
Using persistent master-agent connections reduces TCP port pressure and saves time on connection handshakes, making it more efficient.
agent_blackhole = testbox:9312:testindex1,testindex2
The agent_blackhole directive allows you to forward queries to remote agents without waiting for or processing their responses. This is useful for debugging or testing production clusters, as you can set up a separate debugging/testing instance and forward requests to it from the production master (aggregator) instance, without interfering with production work. The master searchd will attempt to connect to the blackhole agent and send queries as normal, but will not wait for or process any responses, and all network errors on the blackhole agents will be ignored. The format of the value is identical to that of the regular agent directive.
agent_connect_timeout = 300
The agent_connect_timeout directive defines the timeout for connecting to remote agents. By default, the value is assumed to be in milliseconds, but can have another suffix). The default value is 1000 (1 second).
When connecting to remote agents, searchd will wait for this amount of time at most to complete the connection successfully. If the timeout is reached but the connection has not been established, and retries are enabled, a retry will be initiated.
agent_query_timeout = 10000 # our query can be long, allow up to 10 sec
The agent_query_timeout sets the amount of time that searchd will wait for a remote agent to complete a query. The default value is 3000 milliseconds (3 seconds), but can be suffixed to indicate a different unit of time.
After establishing a connection, searchd will wait for a maximum of agent_query_timeout for remote queries to complete. Note that this timeout is separate from the agent_connection_timeout and the total possible delay caused by a remote agent will be the sum of both values. If the agent_query_timeout is reached, the query will not be retried, instead, a warning will be produced.
Note that behavior is also affected by reset_network_timeout_on_packet
The agent_retry_count is an integer that specifies how many times Manticore will attempt to connect and query remote agents in a distributed table before reporting a fatal query error. It works similarly to agent_retry_count defined in the "searchd" section of the configuration file but applies specifically to the table.
mirror_retry_count serves the same purpose as agent_retry_count. If both values are provided, mirror_retry_count will take precedence, and a warning will be raised.
The following options manage the overall behavior of remote agents and are specified in the searchd section of the configuration file. They set default values for the entire Manticore instance.
agent_connect_timeout - default value for the agent_connect_timeout parameter.agent_query_timeout - default value for the agent_query_timeout parameter. This can also be overridden on a per-query basis using the same setting name in a distributed (network) table.agent_retry_count is an integer that specifies the number of times Manticore will attempt to connect and query remote agents in a distributed table before reporting a fatal query error. The default value is 0 (i.e. no retries). This value can also be specified on a per-query basis using the 'OPTION retry_count=XXX' clause. If a per-query option is provided, it will take precedence over the value specified in the config.Note, that if you use agent mirrors in the definition of your distributed table, the server will select a different mirror before each connection attempt according to the specified ha_strategy specified. In this case the agent_retry_count will be aggregated for all mirrors in the set.
For example, if you have 10 mirrors and set agent_retry_count=5, he server will attempt up to 50 retries (assuming an average of 5 tries per every 10 mirrors). In case of the option ha_strategy = roundrobin, it will actually be exactly 5 tries per mirror.
At the same time, the value provided as the retry_count option in the agent definition serves as an absolute limit. In other words, the [retry_count=2] option in the agent definition means there will be a maximum of 2 tries, regardless of whether there is 1 or 10 mirrors in the line.
The agent_retry_delay is an integer value that determines the amount of time, in milliseconds, that Manticore Search will wait before retrying to query a remote agent in case of a failure. This value can be specified either globally in the searchd configuration or on a per-query basis using the OPTION retry_delay=XXX clause. If both options are provided, the per-query option will take precedence over the global one. The default value is 500 milliseconds (0.5 seconds). This option is only relevant if agent_retry_count or the per-query OPTION retry_count are non-zero.
The client_timeout option sets the maximum waiting time between requests when using persistent connections. This value is expressed in seconds or with a time suffix. The default value is 5 minutes.
Example:
client_timeout = 1h
The hostname_lookup option defines the strategy for renewing hostnames. By default, the IP addresses of agent host names are cached at server start to avoid excessive access to DNS. However, in some cases, the IP can change dynamically (e.g. cloud hosting) and it may be desirable to not cache the IPs. Setting this option to request disables the caching and queries the DNS for each query. The IP addresses can also be manually renewed using the FLUSH HOSTNAMES command.
The listen_tfo option allows for the use of the TCP_FASTOPEN flag for all listeners. By default, it is managed by the system, but it can be explicitly turned off by setting it to '0'.
For more information about the TCP Fast Open extension, please refer to Wikipedia. In short, it allows to eliminate one TCP round-trip when establishing a connection.
In practice, using TFO can optimize the client-agent network efficiency, similar to when agent_persistent is in use, but without holding active connections and without limitations on the maximum number of connections.
Most modern operating systems support TFO. Linux (as one of the most progressive) has supported it since 2011, with kernels starting from 3.7 (for the server side). Windows has supported it since some builds of Windows 10. Other systems, such as FreeBSD and MacOS, are also in the game.
For Linux systems, the server checks the variable /proc/sys/net/ipv4/tcp_fastopen and behaves accordingly. Bit 0 manages the client side, while bit 1 rules the listeners. By default, the system has this parameter set to 1, i.e., clients are enabled and listeners are disabled.
persistent_connections_limit = 29 # assume that each host of agents has max_connections = 30 (or 29).
The persistent_connections_limit option defines the maximum number of simultaneous persistent connections to remote persistent agents. This is an instance-wide setting and must be defined in the searchd configuration section. Each time a connection to an agent defined under agent_persistent is made, we attempt to reuse an existing connection (if one exists) or create a new connection and save it for future use. However, in some cases it may be necessary to limit the number of persistent connections. This directive defines the limit and affects the number of connections to each agent's host across all distributed tables.
It is recommended to set this value equal to or less than the max_connections option in the agent's configuration.
A special case of a distributed table is a single local and multiple remotes, which is used exclusively for distributed snippets creation, when snippets are sourced from files. In this case, the local table may act as a "template" table, providing settings for tokenization when building snippets.
snippets_file_prefix = /mnt/common/server1/
The snippets_file_prefix is an optional prefix that can be added to the local file names when generating snippets. The default value is the current working folder.
To learn more about distributed snippets creation, see CALL SNIPPETS.
You can create a distributed table from multiple percolate tables. The syntax for constructing this type of table is the same as for other distributed tables, and can include multiplelocal tables as well as agents.
For DPQ, the operations of listing stored queries and searching through them (using CALL PQ) are transparent and work as if all the tables were one single local table. However, data manipulation statements such as insert, replace, truncate are not available.
If you include a non-percolate table in the list of agents, the behavior will be undefined. If the incorrect agent has the same schema as the outer schema of the PQ table (id, query, tags, filters), it will not trigger an error when listing stored PQ rules, and may pollute the list of actual PQ rules stored in PQ tables with its own non-PQ strings. As a result, be cautious and aware of the confusion that this may cause. ACALL PQ to such an incorrect agent will trigger an error.
For more information on making queries to a distributed percolate table, see making queries to a distribute percolate table.
Manticore Search has a single level of hierarchy for tables.
Unlike other DBMS, there is no concept of grouping tables into databases in Manticore. However, for interoperability with SQL dialects, Manticore accepts SHOW DATABASES statements for interoperability with SQL dialect, statements, but the statement does not return any results.
General syntax:
SHOW TABLES [ LIKE pattern ]
The SHOW TABLESstatement lists all currently active tables along with their types. The existing table types are local, distributed, rt, percolate and template.
SHOW TABLES;
+----------+-------------+
| Index | Type |
+----------+-------------+
| dist | distributed |
| plain | local |
| pq | percolate |
| rt | rt |
| template | template |
+----------+-------------+
5 rows in set (0.00 sec)
$client->nodes()->table();
Array
(
[dist1] => distributed
[rt] => rt
[products] => rt
)
utilsApi.sql('SHOW TABLES')
{u'columns': [{u'Index': {u'type': u'string'}},
{u'Type': {u'type': u'string'}}],
u'data': [{u'Index': u'dist1', u'Type': u'distributed'},
{u'Index': u'rt', u'Type': u'rt'},
{u'Index': u'products', u'Type': u'rt'}],
u'error': u'',
u'total': 0,
u'warning': u''}
res = await utilsApi.sql('SHOW TABLES');
{"columns":[{"Index":{"type":"string"}},{"Type":{"type":"string"}}],"data":[{"Index":"products","Type":"rt"}],"total":0,"error":"","warning":""}
utilsApi.sql("SHOW TABLES")
{columns=[{Index={type=string}}, {Type={type=string}}], data=[{Index=products, Type=rt}], total=0, error=, warning=}
utilsApi.Sql("SHOW TABLES")
{columns=[{Index={type=string}}, {Type={type=string}}], data=[{Index=products, Type=rt}], total=0, error="", warning=""}
Optional LIKE clause is supported for filtering tables by name.
SHOW TABLES LIKE 'pro%';
+----------+-------------+
| Index | Type |
+----------+-------------+
| products | distributed |
+----------+-------------+
1 row in set (0.00 sec)
$client->nodes()->table(['body'=>['pattern'=>'pro%']]);
Array
(
[products] => distributed
)
res = await utilsApi.sql('SHOW TABLES LIKE \'pro%\'');
{u'columns': [{u'Index': {u'type': u'string'}},
{u'Type': {u'type': u'string'}}],
u'data': [{u'Index': u'products', u'Type': u'rt'}],
u'error': u'',
u'total': 0,
u'warning': u''}
utilsApi.sql('SHOW TABLES LIKE \'pro%\'')
{"columns":[{"Index":{"type":"string"}},{"Type":{"type":"string"}}],"data":[{"Index":"products","Type":"rt"}],"total":0,"error":"","warning":""}
utilsApi.sql("SHOW TABLES LIKE 'pro%'")
{columns=[{Index={type=string}}, {Type={type=string}}], data=[{Index=products, Type=rt}], total=0, error=, warning=}
utilsApi.Sql("SHOW TABLES LIKE 'pro%'")
{columns=[{Index={type=string}}, {Type={type=string}}], data=[{Index=products, Type=rt}], total=0, error="", warning=""}
{DESC | DESCRIBE} table [ LIKE pattern ]
The DESCRIBE statement lists the table columns and their associated types. The columns are document ID, full-text fields, and attributes. The order matches the order in which fields and attributes are expected by INSERT and REPLACE statements. Column types include field, integer, timestamp, ordinal, bool, float, bigint, string, and mva. ID column will be typed as bigint. Example:
mysql> DESC rt;
+---------+---------+
| Field | Type |
+---------+---------+
| id | bigint |
| title | field |
| content | field |
| gid | integer |
+---------+---------+
4 rows in set (0.00 sec)
An optional LIKE clause is supported. Refer to
SHOW META for its syntax details.
You can also view the table schema by executing the query select * from <table_name>.@table. The benefit of this method is that you can use the WHERE clause for filtering:
select * from tbl.@table where type='text';
+------+-------+------+----------------+
| id | field | type | properties |
+------+-------+------+----------------+
| 2 | title | text | indexed stored |
+------+-------+------+----------------+
1 row in set (0.00 sec)
You can also perform many other actions on <your_table_name>.@table considering it as a regular Manticore table with columns consisting of integer and string attributes.
select field from tbl.@table;
select field, properties from tbl.@table where type in ('text', 'uint');
select * from tbl.@table where properties any ('stored');
SHOW CREATE TABLE name
Prints the CREATE TABLE statement used to create the specified table.
SHOW CREATE TABLE tbl\G
Table: tbl
Create Table: CREATE TABLE tbl (
f text indexed stored
) charset_table='non_cjk,cjk' morphology='icu_chinese'
1 row in set (0.00 sec)
If you use the DESC statement on a percolate table, it will display the outer table schema, which is the schema of stored queries. This schema is static and the same for all local percolate tables:
mysql> DESC pq;
+---------+--------+
| Field | Type |
+---------+--------+
| id | bigint |
| query | string |
| tags | string |
| filters | string |
+---------+--------+
4 rows in set (0.00 sec)
If you want to view the expected document schema, use the following command:
DESC <pq table name> table:
mysql> DESC pq TABLE;
+-------+--------+
| Field | Type |
+-------+--------+
| id | bigint |
| title | text |
| gid | uint |
+-------+--------+
3 rows in set (0.00 sec)
Also desc pq table like ... is supported and works as follows:
mysql> desc pq table like '%title%';
+-------+------+----------------+
| Field | Type | Properties |
+-------+------+----------------+
| title | text | indexed stored |
+-------+------+----------------+
1 row in set (0.00 sec)
Deleting a table is performed in 2 steps internally:
1. Table is cleared (similar to TRUNCATE)
2. All table files are removed from the table folder. All the external table files that were used by the table (such as wordforms, extensions or stopwords) are also deleted. Note that these external files are copied to the table folder when CREATE TABLE is used, so the original files specified in CREATE TABLE will not be deleted.
Deleting a table is possible only when the server is running in the RT mode. It is possible to delete RT tables, PQ tables and distributed tables.
DROP TABLE products;
Query OK, 0 rows affected (0.02 sec)
POST /cli -d "DROP TABLE products"
{
"total":0,
"error":"",
"warning":""
}
$params = [ 'index' => 'products' ];
$response = $client->indices()->drop($params);
Array
(
[total] => 0
[error] =>
[warning] =>
)
utilsApi.sql('DROP TABLE products')
{u'error': u'', u'total': 0, u'warning': u''}
res = await utilsApi.sql('DROP TABLE products');
{"total":0,"error":"","warning":""}
sqlresult = utilsApi.sql("DROP TABLE products");
{total=0, error=, warning=}
sqlresult = utilsApi.Sql("DROP TABLE products");
{total=0, error="", warning=""}
Here is the syntax of the DROP TABLE statement in SQL:
DROP TABLE [IF EXISTS] index_name
When deleting a table via SQL, adding IF EXISTS can be used to delete the table only if it exists. If you try to delete a non-existing table with the IF EXISTS option, nothing happens.
When deleting a table via PHP, you can add an optional silent parameter which works the same as IF EXISTS.
DROP TABLE IF EXISTS products;
POST /cli -d "DROP TABLE IF EXISTS products"
$params =
[
'index' => 'products',
'body' => ['silent' => true]
];
$client->indices()->drop($params);
utilsApi.sql('DROP TABLE IF EXISTS products')
{u'error': u'', u'total': 0, u'warning': u''}
res = await utilsApi.sql('DROP TABLE IF EXISTS products');
{"total":0,"error":"","warning":""}
sqlresult = utilsApi.sql("DROP TABLE IF EXISTS products");
{total=0, error=, warning=}
sqlresult = utilsApi.Sql("DROP TABLE IF EXISTS products");
{total=0, error="", warning=""}
The table can be emptied with a TRUNCATE TABLE SQL statement or with a truncate() PHP client function.
Here is the syntax for the SQL statement:
TRUNCATE TABLE index_name [WITH RECONFIGURE]
When this statement is executed, it clears the RT table completely. It disposes the in-memory data, unlinks all the table data files, and releases the associated binary logs.
A table can also be emptied with DELETE FROM index WHERE id>0, but it's not recommended as it's slower than TRUNCATE.
TRUNCATE TABLE products;
Query OK, 0 rows affected (0.02 sec)
POST /cli -d "TRUNCATE TABLE products"
{
"total":0,
"error":"",
"warning":""
}
$params = [ 'index' => 'products' ];
$response = $client->indices()->truncate($params);
Array(
[total] => 0
[error] =>
[warning] =>
)
utilsApi.sql('TRUNCATE TABLE products')
{u'error': u'', u'total': 0, u'warning': u''}
res = await utilsApi.sql('TRUNCATE TABLE products');
{"total":0,"error":"","warning":""}
utilsApi.sql("TRUNCATE TABLE products");
{total=0, error=, warning=}
utilsApi.Sql("TRUNCATE TABLE products");
{total=0, error="", warning=""}
One of the possible uses of this command is before attaching a table.
When RECONFIGURE option is used new tokenization, morphology, and other text processing settings specified in the config take effect after the table gets cleared. In case the schema declaration in config is different from the table schema the new schema from config got applied after table get cleared.
With this option clearing and reconfiguring a table becomes one atomic operation.
TRUNCATE TABLE products with reconfigure;
Query OK, 0 rows affected (0.02 sec)
POST /cli -d "TRUNCATE TABLE products with reconfigure"
{
"total":0,
"error":"",
"warning":""
}
$params = [ 'index' => 'products', 'with' => 'reconfigure' ];
$response = $client->indices()->truncate($params);
Array(
[total] => 0
[error] =>
[warning] =>
)
utilsApi.sql('TRUNCATE TABLE products WITH RECONFIGURE')
{u'error': u'', u'total': 0, u'warning': u''}
res = await utilsApi.sql('TRUNCATE TABLE products WITH RECONFIGURE');
{"total":0,"error":"","warning":""}
utilsApi.sql("TRUNCATE TABLE products WITH RECONFIGURE");
{total=0, error=, warning=}
utilsApi.Sql("TRUNCATE TABLE products WITH RECONFIGURE");
{total=0, error="", warning=""}
Manticore Search is a highly distributed system that provides all the necessary components to create a highly available and scalable database for search. This includes:
Manticore Search offers great flexibility in terms of how you set up your cluster. There are no limitations, so it's up to you to design your cluster according to your needs. Simply learn about the tools mentioned above and use them to achieve your desired goal.
To add a new node to a cluster, simply start another instance of Manticore and ensure that it is accessible by the other nodes in the cluster. Connect the new node to the rest of the cluster using a distributed table and ensure data safety with replication.
To understand how to add a distributed table with remote agents, it is important to first have a basic understanding of distributed tables In this article, we will focus on how to use a distributed table as the basis for creating a cluster of Manticore instances.
Here is an example of how to split data over 4 servers, each serving one of the shards:
table mydist {
type = distributed
agent = box1:9312:shard1
agent = box2:9312:shard2
agent = box3:9312:shard3
agent = box4:9312:shard4
}
In the event of a server failure, the distributed table will still work, but the results from the failed shard will be missing.
Now that we've added mirrors, each shard is found on 2 servers. By default, the master (the searchd instance with the distributed table) will randomly pick one of the mirrors.
The mode used for picking mirrors can be set using the ha_strategy setting. In addition to the default random mode there's also ha_strategy = roundrobin.
More advanced strategies based on latency-weighted probabilities include noerrors and nodeads. These not only take out mirrors with issues but also monitor response times and do balancing. If a mirror responds slower (for example, due to some operations running on it), it will receive fewer requests. When the mirror recovers and provides better times, it will receive more requests.
table mydist {
type = distributed
agent = box1:9312|box5:9312:shard1
agent = box2:9312:|box6:9312:shard2
agent = box3:9312:|box7:9312:shard3
agent = box4:9312:|box8:9312:shard4
}
Agent mirrors can be used interchangeably when processing a search query. The Manticore instance(s) hosting the distributed table where the mirrored agents are defined keeps track of mirror status (alive or dead) and response times, and performs automatic failover and load balancing based on this information.
agent = node1|node2|node3:9312:shard2
The above example declares that node1:9312, node2:9312, and node3:9312 all have a table called shard2, and can be used as interchangeable mirrors. If any of these servers go down, the queries will be distributed between the remaining two. When the server comes back online, the master will detect it and begin routing queries to all three nodes again.
A mirror may also include an individual table list, as follows:
agent = node1:9312:node1shard2|node2:9312:node2shard2
This works similarly to the previous example, but different table names will be used when querying different servers. For example, node1shard2 will be used when querying node1:9312, and node2shard will be used when querying node2:9312.
By default, all queries are routed to the best of the mirrors. The best mirror is selected based on recent statistics, as controlled by the ha_period_karma config directive. The master stores metrics (total query count, error count, response time, etc.) for each agent and groups these by time spans. The karma is the length of the time span. The best agent mirror is then determined dynamically based on the last two such time spans. The specific algorithm used to pick a mirror can be configured with the ha_strategy directive.
The karma period is in seconds and defaults to 60 seconds. The master stores up to 15 karma spans with per-agent statistics for instrumentation purposes (see SHOW AGENT STATUS statement). However, only the last two spans out of these are used for HA/LB logic.
When there are no queries, the master sends a regular ping command every ha_ping_interval milliseconds in order to collect statistics and check if the remote host is still alive. The ha_ping_interval defaults to 1000 msec. Setting it to 0 disables pings, and statistics will only be accumulated based on actual queries.
Example:
# sharding table over 4 servers total
# in just 2 shards but with 2 failover mirrors for each shard
# node1, node2 carry shard1 as local
# node3, node4 carry shard2 as local
# config on node1, node2
agent = node3:9312|node4:9312:shard2
# config on node3, node4
agent = node1:9312|node2:9312:shard1
Load balancing is turned on by default for any distributed table that uses mirroring. By default, queries are distributed randomly among the mirrors. You can change this behavior by using the ha_strategy.
ha_strategy = {random|nodeads|noerrors|roundrobin}
The mirror selection strategy for load balancing is optional and is set to random by default.
The strategy used for mirror selection, or in other words, choosing a specific agent mirror in a distributed table, is controlled by this directive. Essentially, this directive controls how master performs the load balancing between the configured mirror agent nodes. The following strategies are implemented:
The default balancing mode is simple linear random distribution among the mirrors. This means that equal selection probabilities are assigned to each mirror. This is similar to round-robin (RR), but does not impose a strict selection order.
ha_strategy = random
The default simple random strategy does not take into account the status of mirrors, error rates, and most importantly, actual response latencies. To address heterogeneous clusters and temporary spikes in agent node load, there are a group of balancing strategies that dynamically adjust the probabilities based on the actual query latencies observed by the master.
The adaptive strategies based on latency-weighted probabilities work as follows:
Initially, the probabilities are equal. On every step, they are scaled by the inverse of the latencies observed during the last karma period, and then renormalized. For example, if during the first 60 seconds after the master startup, 4 mirrors had latencies of 10 ms, 5 ms, 30 ms, and 3 ms respectively, the first adjustment step would go as follows:
This means that the first mirror would have a 15% chance of being chosen during the next karma period, the second one a 30% chance, the third one (slowest at 30 ms) only a 5% chance, and the fourth and fastest one (at 3 ms) a 50% chance. After that period, the second adjustment step would update those chances again, and so on.
The idea is that once the observed latencies stabilize, the latency weighted probabilities will stabilize as well. All these adjustment iterations are meant to converge at a point where the average latencies are roughly equal across all mirrors.
Latency-weighted probabilities, but dead mirrors are excluded from the selection. A "dead" mirror is defined as a mirror that has resulted in multiple hard errors (e.g. network failure, or no answer, etc) in a row.
ha_strategy = nodeads
Latency-weighted probabilities, but mirrors with a worse error/success ratio are excluded from selection.
ha_strategy = noerrors
Simple round-robin selection, that is, selecting the first mirror in the list, then the second one, then the third one, etc, and then repeating the process once the last mirror in the list is reached. Unlike with the randomized strategies, RR imposes a strict querying order (1, 2, 3, ..., N-1, N, 1, 2, 3, ..., and so on) and guarantees that no two consecutive queries will be sent to the same mirror.
ha_strategy = roundrobin
ha_period_karma = 2m
ha_period_karma defines the size of the agent mirror statistics window, in seconds (or a time suffix). Optional, the default is 60.
For a distributed table with agent mirrors, the server tracks several different per-mirror counters. These counters are then used for failover and balancing. (The server picks the best mirror to use based on the counters.) Counters are accumulated in blocks of ha_period_karma seconds.
After beginning a new block, the master may still use the accumulated values from the previous one until the new one is half full. Thus, any previous history stops affecting the mirror choice after at most 1.5 times ha_period_karma seconds.
Although at most 2 blocks are used for mirror selection, up to 15 last blocks are actually stored for instrumentation purposes. They can be inspected using the SHOW AGENT STATUS statement.
ha_ping_interval = 3s
ha_ping_interval directive defines the interval between pings sent to the agent mirrors, in milliseconds (or with a time suffix). This option is optional and its default value is 1000.
For a distributed table with agent mirrors, the server sends all mirrors a ping command during idle periods to track their current status (whether they are alive or dead, network roundtrip time, etc.). The interval between pings is determined by the ha_ping_interval setting.
If you want to disable pings, set ha_ping_interval to 0.
With Manticore, write transactions (such as INSERT, REPLACE, DELETE, TRUNCATE, UPDATE, COMMIT) can be replicated to other cluster nodes before the transaction is fully applied on the current node. Currently, replication is supported for percolate, rt and distributed tables in Linux and macOS. However, Manticore Search packages for Windows do not provide replication support.
Manticore's replication is powered by the Galera library and boasts several impressive features:
To set up replication in Manticore Search:
server_id.If there is no replication listen directive set, Manticore will use the first two free ports in the range of 200 ports after the default protocol listening port for each created cluster. To set replication ports manually, the listen directive (of replication type) port range must be defined and the address/port range pairs must not intersect between different nodes on the same server. As a rule of thumb, the port range should specify at least two ports per cluster.
A replication cluster is a group of nodes in which a write transaction is replicated. Replication is set up on a per-table basis, meaning that one table can only belong to one cluster. There is no limit on the number of tables that a cluster can have. All transactions such as INSERT, REPLACE, DELETE, TRUNCATE on any percolate or real-time table that belongs to a cluster are replicated to all the other nodes in that cluster. Distributed tables can also be part of the replication process. Replication is multi-master, so writes to any node or multiple nodes simultaneously will work just as well.
To create a cluster, you can typically use the command create cluster with CREATE CLUSTER <cluster name>, and to join a cluster, you can use join cluster with JOIN CLUSTER <cluster name> at 'host:port'. However, in some rare cases, you may want to fine-tune the behavior of CREATE/JOIN CLUSTER. The available options are:
This option specifies the name of the cluster. It should be unique among all the clusters in the system.
Note: The maximum allowable hostname length for the
JOINcommand is 253 characters. If you exceed this limit, searchd will generate an error.
The path option specifies the data directory for write-set cache replication and incoming tables from other nodes. This value should be unique among all the clusters in the system and should be specified as a relative path to the data_dir. directory. By default, it is set to the value of data_dir.
The nodes option is a list of address:port pairs for all the nodes in the cluster, separated by commas. This list should be obtained using the node's API interface and can include the address of the current node as well. It is used to join the node to the cluster and to rejoin it after a restart.
The options option allows you to pass additional options directly to the Galera replication plugin, as described in the Galera Documentation Parameters
When working with a replication cluster, all write statements such as INSERT, REPLACE, DELETE, TRUNCATE, UPDATE that modify the content of a cluster's table must use thecluster_name:index_name expression instead of the table name. This ensures that the changes are propagated to all replicas in the cluster. If the correct expression is not used, an error will be triggered.
In the JSON interface, the cluster property must be set along with the table name for all write statements to a cluster's table. Failure to set the cluster property will result in an error.
The Auto ID for a table in a cluster should be valid as long as the server_id is correctly configured.
INSERT INTO posts:weekly_index VALUES ( 'iphone case' )
TRUNCATE RTINDEX click_query:weekly_index
UPDATE INTO posts:rt_tags SET tags=(101, 302, 304) WHERE MATCH ('use') AND id IN (1,101,201)
DELETE FROM clicks:rt WHERE MATCH ('dumy') AND gid>206
POST /insert -d '
{
"cluster":"posts",
"index":"weekly_index",
"doc":
{
"title" : "iphone case",
"price" : 19.85
}
}'
POST /delete -d '
{
"cluster":"posts",
"index": "weekly_index",
"id":1
}'
$index->addDocuments([
1, ['title' => 'iphone case', 'price' => 19.85]
]);
$index->deleteDocument(1);
indexApi.insert({"cluster":"posts","index":"weekly_index","doc":{"title":"iphone case","price":19.85}})
indexApi.delete({"cluster":"posts","index":"weekly_index","id":1})
res = await indexApi.insert({"cluster":"posts","index":"weekly_index","doc":{"title":"iphone case","price":19.85}});
res = await indexApi.delete({"cluster":"posts","index":"weekly_index","id":1});
InsertDocumentRequest newdoc = new InsertDocumentRequest();
HashMap<String,Object> doc = new HashMap<String,Object>(){{
put("title","Crossbody Bag with Tassel");
put("price",19.85);
}};
newdoc.index("weekly_index").cluster("posts").id(1L).setDoc(doc);
sqlresult = indexApi.insert(newdoc);
DeleteDocumentRequest deleteRequest = new DeleteDocumentRequest();
deleteRequest.index("weekly_index").cluster("posts").setId(1L);
indexApi.delete(deleteRequest);
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("title", "Crossbody Bag with Tassel");
doc.Add("price", 19.85);
InsertDocumentRequest newdoc = new InsertDocumentRequest(index: "weekly_index", cluster:posts, id: 1, doc: doc);
var sqlresult = indexApi.Insert(newdoc);
DeleteDocumentRequest deleteDocumentRequest = new DeleteDocumentRequest(index: "weekly_index", cluster: "posts", id: 1);
indexApi.Delete(deleteDocumentRequest);
Read statements such as SELECT, CALL PQ, DESCRIBE can either use regular table names that are not prepended with a cluster name, or they can use the cluster_name:index_nameformat. If the latter is used, the cluster_name component is ignored.
When using the HTTP endpoint json/search, the cluster property can be specified if desired, but it can also be omitted.
SELECT * FROM weekly_index
CALL PQ('posts:weekly_index', 'document is here')
POST /search -d '
{
"cluster":"posts",
"index":"weekly_index",
"query":{"match":{"title":"keyword"}}
}'
POST /search -d '
{
"index":"weekly_index",
"query":{"match":{"title":"keyword"}}
}'
Replication plugin options can be adjusted using the SET statement.
A list of available options can be found in the Galera Documentation Parameters .
SET CLUSTER click_query GLOBAL 'pc.bootstrap' = 1
POST /cli -d "
SET CLUSTER click_query GLOBAL 'pc.bootstrap' = 1
"
It's possible for replicated nodes to diverge from one another, leading to a state where all nodes are labeled as non-primary. This can occur as a result of a network split between nodes, a cluster crash, or if the replication plugin experiences an exception when determining the primary component. In such a scenario, it's necessary to select a node and promote it to the role of primary component.
To identify the node that needs to be promoted, you should compare the last_committed cluster status variable value on all nodes. If all the servers are currently running, there's no need to restart the cluster. Instead, you can simply promote the node with the highest last_committed value to the primary component using the SET statement (as demonstrated in the example).
The other nodes will then reconnect to the primary component and resynchronize their data based on this node.
SET CLUSTER posts GLOBAL 'pc.bootstrap' = 1
POST /cli -d "
SET CLUSTER posts GLOBAL 'pc.bootstrap' = 1
"
To use replication, you need to define one listen port for SphinxAPI protocol and one listen for replication address and port range in the configuration file. Also, specify the data_dir folder to receive incoming tables.
searchd {
listen = 9312
listen = 192.168.1.101:9360-9370:replication
data_dir = /var/lib/manticore/
...
}
To replicate tables, you must create a cluster on the server that has the local tables to be replicated.
CREATE CLUSTER posts
POST /cli -d "
CREATE CLUSTER posts
"
$params = [
'cluster' => 'posts'
]
];
$response = $client->cluster()->create($params);
utilsApi.sql('CREATE CLUSTER posts')
res = await utilsApi.sql('CREATE CLUSTER posts');
utilsApi.sql("CREATE CLUSTER posts");
utilsApi.Sql("CREATE CLUSTER posts");
Add these local tables to the cluster
ALTER CLUSTER posts ADD pq_title
ALTER CLUSTER posts ADD pq_clicks
POST /cli -d "
ALTER CLUSTER posts ADD pq_title
"
POST /cli -d "
ALTER CLUSTER posts ADD pq_clicks
"
$params = [
'cluster' => 'posts',
'body' => [
'operation' => 'add',
'index' => 'pq_title'
]
];
$response = $client->cluster()->alter($params);
$params = [
'cluster' => 'posts',
'body' => [
'operation' => 'add',
'index' => 'pq_clicks'
]
];
$response = $client->cluster()->alter($params);
utilsApi.sql('ALTER CLUSTER posts ADD pq_title')
utilsApi.sql('ALTER CLUSTER posts ADD pq_clicks')
res = await utilsApi.sql('ALTER CLUSTER posts ADD pq_title');
res = await utilsApi.sql('ALTER CLUSTER posts ADD pq_clicks');
utilsApi.sql("ALTER CLUSTER posts ADD pq_title");
utilsApi.sql("ALTER CLUSTER posts ADD pq_clicks");
utilsApi.Sql("ALTER CLUSTER posts ADD pq_title");
utilsApi.Sql("ALTER CLUSTER posts ADD pq_clicks");
All other nodes that wish to receive a replica of the cluster's tables should join the cluster as follows:
JOIN CLUSTER posts AT '192.168.1.101:9312'
POST /cli -d "
JOIN CLUSTER posts AT '192.168.1.101:9312'
"
$params = [
'cluster' => 'posts',
'body' => [
'192.168.1.101:9312'
]
];
$response = $client->cluster->join($params);
utilsApi.sql('JOIN CLUSTER posts AT \'192.168.1.101:9312\'')
res = await utilsApi.sql('JOIN CLUSTER posts AT \'192.168.1.101:9312\'');
utilsApi.sql("JOIN CLUSTER posts AT '192.168.1.101:9312'");
utilsApi.Sql("JOIN CLUSTER posts AT '192.168.1.101:9312'");
When running queries, prepend the table name with the cluster name posts: or use the cluster property for HTTP request object.
INSERT INTO posts:pq_title VALUES ( 3, 'test me' )
POST /insert -d '
{
"cluster":"posts",
"index":"pq_title",
"id": 3
"doc":
{
"title" : "test me"
}
}'
$index->addDocuments([
3, ['title' => 'test me']
]);
indexApi.insert({"cluster":"posts","index":"pq_title","id":3"doc":{"title":"test me"}})
res = await indexApi.insert({"cluster":"posts","index":"pq_title","id":3"doc":{"title":"test me"}});
InsertDocumentRequest newdoc = new InsertDocumentRequest();
HashMap<String,Object> doc = new HashMap<String,Object>(){{
put("title","test me");
}};
newdoc.index("pq_title").cluster("posts").id(3L).setDoc(doc);
sqlresult = indexApi.insert(newdoc);
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("title", "test me");
InsertDocumentRequest newdoc = new InsertDocumentRequest(index: "pq_title", cluster: "posts", id: 3, doc: doc);
var sqlresult = indexApi.Insert(newdoc);
All queries that modify tables in the cluster are now replicated to all nodes in the cluster.
To create a replication cluster, you must set its name at a minimum.
If you are creating a single cluster or the first cluster, you may omit the path option. In this case, the data_dir option will be used as the cluster path. However, for all subsequent clusters, you must specify the path and the path must be available. The nodes option may also be set to list all nodes in the cluster.
CREATE CLUSTER posts
CREATE CLUSTER click_query '/var/data/click_query/' as path
CREATE CLUSTER click_query '/var/data/click_query/' as path, 'clicks_mirror1:9312,clicks_mirror2:9312,clicks_mirror3:9312' as nodes
POST /cli -d "
CREATE CLUSTER posts
"
POST /cli -d "
CREATE CLUSTER click_query '/var/data/click_query/' as path
"
POST /cli -d "
CREATE CLUSTER click_query '/var/data/click_query/' as path, 'clicks_mirror1:9312,clicks_mirror2:9312,clicks_mirror3:9312' as nodes
"
$params = [
'cluster' => 'posts',
]
];
$response = $client->cluster()->create($params);
$params = [
'cluster' => 'click_query',
'body' => [
'path' => '/var/data/click_query/'
]
]
];
$response = $client->cluster()->create($params);
$params = [
'cluster' => 'click_query',
'body' => [
'path' => '/var/data/click_query/',
'nodes' => 'clicks_mirror1:9312,clicks_mirror2:9312,clicks_mirror3:9312'
]
]
];
$response = $client->cluster()->create($params);
utilsApi.sql('CREATE CLUSTER posts')
utilsApi.sql('CREATE CLUSTER click_query \'/var/data/click_query/\' as path')
utilsApi.sql('CREATE CLUSTER click_query \'/var/data/click_query/\' as path, \'clicks_mirror1:9312,clicks_mirror2:9312,clicks_mirror3:9312\' as nodes')
res = await utilsApi.sql('CREATE CLUSTER posts');
res = await utilsApi.sql('CREATE CLUSTER click_query \'/var/data/click_query/\' as path');
res = await utilsApi.sql('CREATE CLUSTER click_query \'/var/data/click_query/\' as path, \'clicks_mirror1:9312,clicks_mirror2:9312,clicks_mirror3:9312\' as nodes');
utilsApi.sql("CREATE CLUSTER posts");
utilsApi.sql("CREATE CLUSTER click_query '/var/data/click_query/' as path");
utilsApi.sql("CREATE CLUSTER click_query '/var/data/click_query/' as path, 'clicks_mirror1:9312,clicks_mirror2:9312,clicks_mirror3:9312' as nodes");
utilsApi.Sql("CREATE CLUSTER posts");
utilsApi.Sql("CREATE CLUSTER click_query '/var/data/click_query/' as path");
utilsApi.Sql("CREATE CLUSTER click_query '/var/data/click_query/' as path, 'clicks_mirror1:9312,clicks_mirror2:9312,clicks_mirror3:9312' as nodes");
If the nodes option is not specified when creating a cluster, the first node that joins the cluster will be saved as the nodes option.
To join an existing cluster, you must specify at least:
host:port of another node in the cluster you are joiningJOIN CLUSTER posts AT '10.12.1.35:9312'
POST /cli -d "
JOIN CLUSTER posts AT '10.12.1.35:9312'
"
$params = [
'cluster' => 'posts',
'body' => [
'10.12.1.35:9312'
]
];
$response = $client->cluster->join($params);
utilsApi.sql('JOIN CLUSTER posts AT \'10.12.1.35:9312\'')
{u'error': u'', u'total': 0, u'warning': u''}
res = await utilsApi.sql('JOIN CLUSTER posts AT \'10.12.1.35:9312\'');
{"total":0,"error":"","warning":""}
utilsApi.sql("JOIN CLUSTER posts AT '10.12.1.35:9312'");
utilsApi.Sql("JOIN CLUSTER posts AT '10.12.1.35:9312'");
In most cases, the above is sufficient when there is a single replication cluster. However, if you are creating multiple replication clusters, you must also set the path and ensure that the directory is available.
JOIN CLUSTER c2 at '127.0.0.1:10201' 'c2' as path
A node joins a cluster by obtaining data from a specified node and, if successful, updates the node lists across all other cluster nodes in the same way as if it was done manually through ALTER CLUSTER ... UPDATE nodes. This list is used to re-join nodes to the cluster upon restart.
There are two lists of nodes:
1.cluster_<name>_nodes_set: used to re-join nodes to the cluster upon restart. It is updated across all nodes in the same way as ALTER CLUSTER ... UPDATE nodes does. JOIN CLUSTER command performs this update automatically. The Cluster status displays this list as cluster_<name>_nodes_set.
2. cluster_<name>_nodes_view: this list contains all active nodes used for replication and does not require manual management. ALTER CLUSTER ... UPDATE nodes actually copies this list of nodes to the list of nodes used to re-join upon restart. The Cluster status displays this list as cluster_<name>_nodes_view.
When nodes are located in different network segments or data centers, the nodes option may be set explicitly. This minimizes traffic between nodes and utilizes gateway nodes for intercommunication between data centers. The following code joins an existing cluster using the nodes option.
Note: The cluster
cluster_<name>_nodes_setlist is not updated automatically when this syntax is used. To update it, use ALTER CLUSTER ... UPDATE nodes.
JOIN CLUSTER click_query 'clicks_mirror1:9312;clicks_mirror2:9312;clicks_mirror3:9312' as nodes
POST /cli -d "
JOIN CLUSTER click_query 'clicks_mirror1:9312;clicks_mirror2:9312;clicks_mirror3:9312' as nodes
"
$params = [
'cluster' => 'posts',
'body' => [
'nodes' => 'clicks_mirror1:9312;clicks_mirror2:9312;clicks_mirror3:9312'
]
];
$response = $client->cluster->join($params);
utilsApi.sql('JOIN CLUSTER click_query \'clicks_mirror1:9312;clicks_mirror2:9312;clicks_mirror3:9312\' as nodes')
{u'error': u'', u'total': 0, u'warning': u''}
res = await utilsApi.sql('JOIN CLUSTER click_query \'clicks_mirror1:9312;clicks_mirror2:9312;clicks_mirror3:9312\' as nodes');
{"total":0,"error":"","warning":""}
utilsApi.sql("JOIN CLUSTER click_query 'clicks_mirror1:9312;clicks_mirror2:9312;clicks_mirror3:9312' as nodes");
utilsApi.Sql("JOIN CLUSTER click_query 'clicks_mirror1:9312;clicks_mirror2:9312;clicks_mirror3:9312' as nodes");
The JOIN CLUSTER command works synchronously and completes as soon as the node receives all data from the other nodes in the cluster and is in sync with them.
The DELETE CLUSTER statement removes the specified cluster with its name. Once the cluster is deleted, it is removed from all nodes, but its tables remain intact and become active local non-replicated tables.
DELETE CLUSTER click_query
POST /cli -d "DELETE CLUSTER click_query"
$params = [
'cluster' => 'click_query',
'body' => []
];
$response = $client->cluster()->delete($params);
utilsApi.sql('DELETE CLUSTER click_query')
{u'error': u'', u'total': 0, u'warning': u''}
res = await utilsApi.sql('DELETE CLUSTER click_query');
{"total":0,"error":"","warning":""}
utilsApi.sql("DELETE CLUSTER click_query");
utilsApi.Sql("DELETE CLUSTER click_query");
ALTER CLUSTER <cluster_name> ADD <table_name> adds an existing local table to the cluster. The node that receives the ALTER query sends the table to the other nodes in the cluster. All the local tables with the same name on the other nodes of the cluster are replaced with the new table.
Once the table is replicated, write statements can be performed on any node, but the table name must be prefixed with the cluster name, like INSERT INTO <clusterName>:<table_name>.
ALTER CLUSTER click_query ADD clicks_daily_index
POST /cli -d "
ALTER CLUSTER click_query ADD clicks_daily_index
"
$params = [
'cluster' => 'click_query',
'body' => [
'operation' => 'add',
'index' => 'clicks_daily_index'
]
];
$response = $client->cluster()->alter($params);
utilsApi.sql('ALTER CLUSTER click_query ADD clicks_daily_index')
{u'error': u'', u'total': 0, u'warning': u''}
res = await utilsApi.sql('ALTER CLUSTER click_query ADD clicks_daily_index');
{"total":0,"error":"","warning":""}
utilsApi.sql("ALTER CLUSTER click_query ADD clicks_daily_index");
utilsApi.Sql("ALTER CLUSTER click_query ADD clicks_daily_index");
ALTER CLUSTER <cluster_name> DROP <table_name> forgets about a local table, meaning it does not remove the table files on the nodes, but rather just makes it an inactive, non-replicated table.
Once a table is removed from a cluster, it becomes a local table, and write statements must use just the table name, like INSERT INTO <table_name>, without the cluster prefix.
ALTER CLUSTER posts DROP weekly_index
POST /cli -d "
ALTER CLUSTER posts DROP weekly_index
"
$params = [
'cluster' => 'posts',
'body' => [
'operation' => 'drop',
'index' => 'weekly_index'
]
];
$response = $client->cluster->alter($params);
utilsApi.sql('ALTER CLUSTER posts DROP weekly_index')
{u'error': u'', u'total': 0, u'warning': u''}
res = await utilsApi.sql('ALTER CLUSTER posts DROP weekly_index');
{"total":0,"error":"","warning":""}
utilsApi.sql("ALTER CLUSTER posts DROP weekly_index");
utilsApi.Sql("ALTER CLUSTER posts DROP weekly_index");
The ALTER CLUSTER <cluster_name> UPDATE nodes statement updates the node lists on each node within the specified cluster to include all active nodes in the cluster. For more information on node lists, see Joining a cluster.
ALTER CLUSTER posts UPDATE nodes
POST /cli -d "
ALTER CLUSTER posts UPDATE nodes
"
$params = [
'cluster' => 'posts',
'body' => [
'operation' => 'update',
]
];
$response = $client->cluster()->alter($params);
utilsApi.sql('ALTER CLUSTER posts UPDATE nodes')
{u'error': u'', u'total': 0, u'warning': u''}
res = await utilsApi.sql('ALTER CLUSTER posts UPDATE nodes');
{"total":0,"error":"","warning":""}
utilsApi.sql("ALTER CLUSTER posts UPDATE nodes");
utilsApi.Sql("ALTER CLUSTER posts UPDATE nodes");
For instance, when the cluster was initially established, the list of nodes used to rejoin the cluster was 10.10.0.1:9312,10.10.1.1:9312. Since then, other nodes joined the cluster and now the active nodes are 10.10.0.1:9312,10.10.1.1:9312,10.15.0.1:9312,10.15.0.3:9312.However, the list of nodes used to rejoin the cluster has not been updated.
To rectify this, you can run the ALTER CLUSTER ... UPDATE nodes statement to copy the list of active nodes to the list of nodes used to rejoin the cluster. After this, the list of nodes used to rejoin the cluster will include all the active nodes in the cluster.
Both lists of nodes can be viewed using the Cluster status statement (cluster_post_nodes_set and cluster_post_nodes_view).
To remove a node from the replication cluster, follow these steps:
1. Stop the node
2. Remove the information about the cluster from <data_dir>/manticore.json (usually /var/lib/manticore/manticore.json) on the node that has been stopped.
3. Run ALTER CLUSTER cluster_name UPDATE nodes on any other node.
After these steps, the other nodes will forget about the detached node and the detached node will forget about the cluster. This action will not impact the tables in the cluster or on the detached node.
You can view the cluster status information by checking the node status. This can be done using the Node status command, which displays various information about the node, including the cluster status variables.
The output format for the cluster status variables is as follows: cluster_name_variable_name variable_value. Most of the variables are described in the Galera Documentation Status Variables. In addition to these variables, Manticore Search also displays:
closed, destroyed, joining, donor, syncedCREATE, JOIN or ALTER UPDATE commandsSHOW STATUS
+----------------------------+-------------------------------------------------------------------------------------+
| Counter | Value |
+----------------------------+-------------------------------------------------------------------------------------+
| cluster_name | post |
| cluster_post_state_uuid | fba97c45-36df-11e9-a84e-eb09d14b8ea7 |
| cluster_post_conf_id | 1 |
| cluster_post_status | primary |
| cluster_post_size | 5 |
| cluster_post_local_index | 0 |
| cluster_post_node_state | synced |
| cluster_post_indexes_count | 2 |
| cluster_post_indexes | pq1,pq_posts |
| cluster_post_nodes_set | 10.10.0.1:9312 |
| cluster_post_nodes_view | 10.10.0.1:9312,10.10.0.1:9320:replication,10.10.1.1:9312,10.10.1.1:9320:replication |
POST /cli -d "
SHOW STATUS
"
"
{"columns":[{"Counter":{"type":"string"}},{"Value":{"type":"string"}}],
"data":[
{"Counter":"cluster_name", "Value":"post"},
{"Counter":"cluster_post_state_uuid", "Value":"fba97c45-36df-11e9-a84e-eb09d14b8ea7"},
{"Counter":"cluster_post_conf_id", "Value":"1"},
{"Counter":"cluster_post_status", "Value":"primary"},
{"Counter":"cluster_post_size", "Value":"5"},
{"Counter":"cluster_post_local_index", "Value":"0"},
{"Counter":"cluster_post_node_state", "Value":"synced"},
{"Counter":"cluster_post_indexes_count", "Value":"2"},
{"Counter":"cluster_post_indexes", "Value":"pq1,pq_posts"},
{"Counter":"cluster_post_nodes_set", "Value":"10.10.0.1:9312"},
{"Counter":"cluster_post_nodes_view", "Value":"10.10.0.1:9312,10.10.0.1:9320:replication,10.10.1.1:9312,10.10.1.1:9320:replication"}
],
"total":0,
"error":"",
"warning":""
}
$params = [
'body' => []
];
$response = $client->nodes()->status($params);
(
"cluster_name" => "post",
"cluster_post_state_uuid" => "fba97c45-36df-11e9-a84e-eb09d14b8ea7",
"cluster_post_conf_id" => 1,
"cluster_post_status" => "primary",
"cluster_post_size" => 5,
"cluster_post_local_index" => 0,
"cluster_post_node_state" => "synced",
"cluster_post_indexes_count" => 2,
"cluster_post_indexes" => "pq1,pq_posts",
"cluster_post_nodes_set" => "10.10.0.1:9312",
"cluster_post_nodes_view" => "10.10.0.1:9312,10.10.0.1:9320:replication,10.10.1.1:9312,10.10.1.1:9320:replication"
)
utilsApi.sql('SHOW STATUS')
{u'columns': [{u'Key': {u'type': u'string'}},
{u'Value': {u'type': u'string'}}],
u'data': [
{u'Key': u'cluster_name', u'Value': u'post'},
{u'Key': u'cluster_post_state_uuid', u'Value': u'fba97c45-36df-11e9-a84e-eb09d14b8ea7'},
{u'Key': u'cluster_post_conf_id', u'Value': u'1'},
{u'Key': u'cluster_post_status', u'Value': u'primary'},
{u'Key': u'cluster_post_size', u'Value': u'5'},
{u'Key': u'cluster_post_local_index', u'Value': u'0'},
{u'Key': u'cluster_post_node_state', u'Value': u'synced'},
{u'Key': u'cluster_post_indexes_count', u'Value': u'2'},
{u'Key': u'cluster_post_indexes', u'Value': u'pq1,pq_posts'},
{u'Key': u'cluster_post_nodes_set', u'Value': u'10.10.0.1:9312'},
{u'Key': u'cluster_post_nodes_view', u'Value': u'10.10.0.1:9312,10.10.0.1:9320:replication,10.10.1.1:9312,10.10.1.1:9320:replication'}],
u'error': u'',
u'total': 0,
u'warning': u''}
res = await utilsApi.sql('SHOW STATUS');
{"columns": [{"Key": {"type": "string"}},
{"Value": {"type": "string"}}],
"data": [
{"Key": "cluster_name", "Value": "post"},
{"Key": "cluster_post_state_uuid", "Value": "fba97c45-36df-11e9-a84e-eb09d14b8ea7"},
{"Key": "cluster_post_conf_id", "Value": "1"},
{"Key": "cluster_post_status", "Value": "primary"},
{"Key": "cluster_post_size", "Value": "5"},
{"Key": "cluster_post_local_index", "Value": "0"},
{"Key": "cluster_post_node_state", "Value": "synced"},
{"Key": "cluster_post_indexes_count", "Value": "2"},
{"Key": "cluster_post_indexes", "Value": "pq1,pq_posts"},
{"Key": "cluster_post_nodes_set", "Value": "10.10.0.1:9312"},
{"Key": "cluster_post_nodes_view", "Value": "10.10.0.1:9312,10.10.0.1:9320:replication,10.10.1.1:9312,10.10.1.1:9320:replication"}],
"error": "",
"total": 0,
"warning": ""}
utilsApi.sql("SHOW STATUS");
{columns=[{ Key : { type=string }},
{ Value : { type=string }}],
data : [
{ Key=cluster_name, Value=post},
{ Key=cluster_post_state_uuid, Value=fba97c45-36df-11e9-a84e-eb09d14b8ea7},
{ Key=cluster_post_conf_id, Value=1},
{ Key=cluster_post_status, Value=primary},
{ Key=cluster_post_size, Value=5},
{ Key=cluster_post_local_index, Value=0},
{ Key=cluster_post_node_state, Value=synced},
{ Key=cluster_post_indexes_count, Value=2},
{ Key=cluster_post_indexes, Value=pq1,pq_posts},
{ Key=cluster_post_nodes_set, Value=10.10.0.1:9312},
{ Key=cluster_post_nodes_view, Value=10.10.0.1:9312,10.10.0.1:9320:replication,10.10.1.1:9312,10.10.1.1:9320:replication}],
error= ,
total=0,
warning= }
utilsApi.sql("SHOW STATUS");
{columns=[{ Key : { type=String }},
{ Value : { type=String }}],
data : [
{ Key=cluster_name, Value=post},
{ Key=cluster_post_state_uuid, Value=fba97c45-36df-11e9-a84e-eb09d14b8ea7},
{ Key=cluster_post_conf_id, Value=1},
{ Key=cluster_post_status, Value=primary},
{ Key=cluster_post_size, Value=5},
{ Key=cluster_post_local_index, Value=0},
{ Key=cluster_post_node_state, Value=synced},
{ Key=cluster_post_indexes_count, Value=2},
{ Key=cluster_post_indexes, Value=pq1,pq_posts},
{ Key=cluster_post_nodes_set, Value=10.10.0.1:9312},
{ Key=cluster_post_nodes_view, Value=10.10.0.1:9312,10.10.0.1:9320:replication,10.10.1.1:9312,10.10.1.1:9320:replication}],
error="" ,
total=0,
warning="" }
In a multi-master replication cluster, a reference point must be established before other nodes can join and form the cluster. This is called cluster bootstrapping and involves starting a single node as the primary component. Restarting a single node or reconnecting after a shutdown can be done normally.
In case of a full cluster shutdown, the server that was stopped last should be started first with the --new-cluster command line option or by running manticore_new_cluster through systemd. To ensure that the server is capable of being the reference point, the grastate.dat file located at the cluster path should be updated with a value of 1 for the safe_to_bootstrap option. Both conditions, --new-cluster and safe_to_bootstrap=1, must be met. If any other node is started without these options set, an error will occur. The --new-cluster-force command line option can be used to override this protection and start the cluster from another server forcibly. Alternatively, you can run manticore_new_cluster --force to use systemd.
In the event of a hard crash or an unclean shutdown of all servers in the cluster, the most advanced node with the largest seqno in the grastate.dat file located at the cluster path must be identified and started with the --new-cluster-force command line key.
In the event that the Manticore search daemon stops with no remaining nodes in the cluster to serve requests, recovery is necessary. Due to the multi-master nature of the Galera library used for replication, Manticore replication cluster is a single logical entity that maintains the consistency of its nodes and data, and the status of the entire cluster. This allows for safe writes on multiple nodes simultaneously and ensures the integrity of the cluster.
However, this also poses challenges. Let's examine several scenarios, using a cluster of nodes A, B, and C, to see what needs to be done when some or all nodes become unavailable.
When node A is stopped, the other nodes receive a "normal shutdown" message. The cluster size is reduced, and a quorum re-calculation takes place.
Upon starting node A, it joins the cluster and will not serve any write transactions until it is fully synchronized with the cluster. If the writeset cache on donor nodes B or C (which can be controlled with the Galera cluster's gcache.size) still contains all of the transactions missed at node A, node A will receive a fast incremental state transfer (IST), that is, a transfer of only missed transactions. If not, a snapshot state transfer (SST) will occur, which involves the transfer of table files.
In the scenario where nodes A and B are stopped, the cluster size is reduced to one, with node C forming the primary component to handle write transactions.
Nodes A and B can then be started as usual and will join the cluster after start-up. Node C acts as the donor, providing the state transfer to nodes A and B.
All nodes are stopped as usual and the cluster is off.
The problem now is how to initialize the cluster. It's important that on a clean shutdown of searchd the nodes write the number of last executed transaction into the cluster directory grastate.dat file along with flag safe_to_bootstrap. The node which was stopped last will have option safe_to_bootstrap: 1 and the most advanced seqno number.
It is important that this node starts first to form the cluster. To bootstrap a cluster the server should be started on this node with flag --new-cluster. On Linux you can also run manticore_new_cluster which will start Manticore in --new-cluster mode via systemd.
If another node starts first and bootstraps the cluster, then the most advanced node joins that cluster, performs full SST and receives a table file where some transactions are missed in comparison with the table files it got before. That is why it is important to start first the node which was shut down last, it should have flag safe_to_bootstrap: 1 in grastate.dat.
In the event of a crash or network failure causing Node A to disappear from the cluster, nodes B and C will attempt to reconnect with Node A. Upon failure, they will remove Node A from the cluster. With two out of the three nodes still running, the cluster maintains its quorum and continues to operate normally.
When Node A is restarted, it will join the cluster automatically, as outlined in Case 1.
Nodes A and B have gone offline. Node C is unable to form a quorum on its own as 1 node is less than half of the total nodes (3). As a result, the cluster on node C is shifted to a non-primary state and rejects any write transactions with an error message.
Meanwhile, node C waits for the other nodes to connect and also tries to connect to them. If this happens, and the network is restored and nodes A and B are back online, the cluster will automatically reform. If nodes A and B are just temporarily disconnected from node C but can still communicate with each other, they will continue to operate as normal, as they still form the quorum.
However, if both nodes A and B have crashed or restarted due to a power failure, someone must activate the primary component on node C using the following command:
SET CLUSTER posts GLOBAL 'pc.bootstrap' = 1
POST /cli -d "
SET CLUSTER posts GLOBAL 'pc.bootstrap' = 1
"
t's important to note that before executing this command, you must confirm that the other nodes are truly unreachable. Otherwise, a split-brain scenario may occur and separate clusters may form.
All nodes have crashed. In this situation, the grastate.dat file in the cluster directory has not been updated and does not contain a valid seqnosequence number.
If this occurs, someone needs to locate the node with the most recent data and start the server on it using the --new-cluster-force command line key. All other nodes will start as normal, as described in Case 3).
On Linux, you can also use the manticore_new_cluster --force, command, which will start Manticore in --new-cluster-force mode via systemd.
Split-brain can cause the cluster to transition into a non-primary state. For example, consider a cluster comprised of an even number of nodes (four), such as two pairs of nodes located in different data centers. If a network failure interrupts the connection between the data centers, split-brain occurs as each group of nodes holds exactly half of the quorum. As a result, both groups stop handling write transactions, since the Galera replication model prioritizes data consistency, and the cluster cannot accept write transactions without a quorum. However, nodes in both groups attempt to reconnect with the nodes from the other group in an effort to restore the cluster.
If someone wants to restore the cluster before the network is restored, the same steps outlined in Case 5 hould be taken, but only at one group of nodes.
After the statement is executed, the group with the node that it was run on will be able to handle write transactions once again.
SET CLUSTER posts GLOBAL 'pc.bootstrap' = 1
POST /cli -d "
SET CLUSTER posts GLOBAL 'pc.bootstrap' = 1
"
However, it's important to note that if the statement is issued at both groups, it will result in the formation of two separate clusters, and the subsequent network recovery will not result in the groups rejoining.
With default configuration, Manticore is waiting for your connections on:
mysql -h0 -P9306
curl -s "http://localhost:9308/search"
require_once __DIR__ . '/vendor/autoload.php';
$config = ['host'=>'127.0.0.1','port'=>9308];
$client = new \Manticoresearch\Client($config);
import manticoresearch
config = manticoresearch.Configuration(
host = "http://127.0.0.1:9308"
)
client = manticoresearch.ApiClient(config)
indexApi = manticoresearch.IndexApi(client)
searchApi = manticoresearch.searchApi(client)
utilsApi = manticoresearch.UtilsApi(client)
var Manticoresearch = require('manticoresearch');
var client= new Manticoresearch.ApiClient()
client.basePath="http://127.0.0.1:9308";
indexApi = new Manticoresearch.IndexApi(client);
searchApi = new Manticoresearch.SearchApi(client);
utilsApi = new Manticoresearch.UtilsApi(client);
import com.manticoresearch.client.ApiClient;
import com.manticoresearch.client.ApiException;
import com.manticoresearch.client.Configuration;
import com.manticoresearch.client.model.*;
import com.manticoresearch.client.api.IndexApi;
import com.manticoresearch.client.api.UtilsApi;
import com.manticoresearch.client.api.SearchApi;
ApiClient client = Configuration.getDefaultApiClient();
client.setBasePath("http://127.0.0.1:9308");
IndexApi indexApi = new IndexApi(client);
SearchApi searchApi = new UtilsApi(client);
UtilsApi utilsApi = new UtilsApi(client);
using ManticoreSearch.Client;
using ManticoreSearch.Api;
using ManticoreSearch.Model;
string basePath = "http://127.0.0.1:9308";
IndexApi indexApi = new IndexApi(basePath);
SearchApi searchApi = new UtilsApi(basePath);
UtilsApi utilsApi = new UtilsApi(basePath);
docker run -e EXTRA=1 --name manticore -d manticoresearch/manticore && docker exec -it manticore mysql
Manticore Search implements an SQL interface using the MySQL protocol, allowing any MySQL library or connector and many MySQL clients to be used to connect to Manticore Search and work with it as if it were a MySQL server, not Manticore.
However, the SQL dialect is different and implements only a subset of the SQL commands or functions available in MySQL. Additionally, there are clauses and functions that are specific to Manticore Search, such as the MATCH() clause for full-text search.
Manticore Search does not support server-side prepared statements, but client-side prepared statements can be used. It is important to note that Manticore implements the multi-value (MVA) data type, which has no equivalent in MySQL or libraries implementing prepared statements. In these cases, the MVA values must be crafted in the raw query.
Some MySQL clients/connectors require values for user/password and/or database name. Since Manticore Search does not have the concept of databases and there is no user access control implemented, these values can be set arbitrarily as Manticore will simply ignore them.
The default port for the SQL interface is 9306 and it's enabled by default.
You can configure the MySQL port in the searchd section of the configuration file using the listen directive like this:
searchd {
...
listen = 127.0.0.1:9306:mysql
...
}
Keep in mind that Manticore doesn't have user authentication, so make sure that the MySQL port is not accessible to anyone outside of your network.
A separate MySQL port can be used for performing "VIP" connections. When connecting to this port, the thread pool is bypassed, and a new dedicated thread is always created. This is useful in cases of severe overload, where the server would either stall or prevent a connection through the regular port.
searchd {
...
listen = 127.0.0.1:9306:mysql
listen = 127.0.0.1:9307:mysql_vip
...
}
The easiest way to connect to Manticore is by using a standard MySQL client:
mysql -P9306 -h0
The MySQL protocol supports SSL encryption. Secure connections can be made on the same mysql listening port.
Compression can be used with MySQL connections and is available to clients by default. The client just needs to specify that the connection should use compression.
An example using the MySQL client:
mysql -P9306 -h0 -C
Compression can be used in both secured and non-secured connections.
The official MySQL connectors can be used to connect to Manticore Search, however they might require certain settings passed in the DSN string as the connector can try running certain SQL commands not implemented yet in Manticore.
JDBC Connector 6.x and above require Manticore Search 2.8.2 or greater and the DSN string should contain the following options:
jdbc:mysql://IP:PORT/DB/?characterEncoding=utf8&maxAllowedPacket=512000&serverTimezone=XXX
By default Manticore Search will report it's own version to the connector, however this may cause some troubles. To overcome that mysql_version_string directive in searchd section of the configuration should be set to a version lower than 5.1.1:
searchd {
...
mysql_version_string = 5.0.37
...
}
.NET MySQL connector uses connection pools by default. To correctly get the statistics of SHOW META, queries along with SHOW META command should be sent as a single multistatement (SELECT ...;SHOW META). If pooling is enabled option Allow Batch=True is required to be added to the connection string to allow multistatements:
Server=127.0.0.1;Port=9306;Database=somevalue;Uid=somevalue;Pwd=;Allow Batch=True;
Manticore can be accessed using ODBC. It's recommended to set charset=UTF8 in the ODBC string. Some ODBC drivers will not like the reported version by the Manticore server as they will see it as a very old MySQL server. This can be overridden with mysql_version_string option.
Manticore SQL over MySQL supports C-style comment syntax. Everything from an opening /* sequence to a closing */ sequence is ignored. Comments can span multiple lines, can not nest, and should not get logged. MySQL specific /*! ... */ comments are also currently ignored. (As the comments support was rather added for better compatibility with mysqldump produced dumps, rather than improving general query interoperability between Manticore and MySQL.)
SELECT /*! SQL_CALC_FOUND_ROWS */ col1 FROM table1 WHERE ...
You can connect to Manticore Search through HTTP/HTTPS.
By default, Manticore listens for HTTP, HTTPS, and binary requests on ports 9308 and 9312.
In the "searchd" section of your configuration file, you can define the HTTP port using the listen directive as follows:
Both lines are valid and have the same meaning (except for the port number). They both define listeners that will serve all API/HTTP/HTTPS protocols. There are no special requirements, and any HTTP client can be used to connect to Manticore.
searchd {
...
listen = 127.0.0.1:9308
listen = 127.0.0.1:9312:http
...
}
All HTTP endpoints return application/json content type. For the most part, endpoints use JSON payloads for requests. However, there are some exceptions that use NDJSON or simple URL-encoded payloads.
Currently, there is no user authentication. Therefore, make sure that the HTTP interface is not accessible to anyone outside your network. As Manticore functions like any other web server, you can use a reverse proxy, such as Nginx, to implement HTTP authentication or caching.
The HTTP protocol also supports SSL encryption:
If you specify :https instead of :http only secured connections will be accepted. Otherwise in case of no valid key/certificate provided, but the client trying to connect via https - the connection will be dropped. If you make not HTTPS, but an HTTP request to 9443 it will respond with HTTP code 400.
searchd {
...
listen = 127.0.0.1:9308
listen = 127.0.0.1:9443:https
...
}
Separate HTTP interface can be used for 'VIP' connections. In this case, the connection bypasses a thread pool and always creates a new dedicated thread. This is useful for managing Manticore Search during periods of severe overload when the server might stall or not allow regular port connections.
For more information on the listen directive, see this section.
searchd {
...
listen = 127.0.0.1:9308
listen = 127.0.0.1:9318:_vip
...
}
Endpoints /sql and /cli allow running SQL queries via HTTP.
/sql endpoint accepts only SELECT statements and returns the response in HTTP JSON format. The query parameter should be URL-encoded./sql?mode=raw endpoint accepts any SQL query and returns the response in raw format, similar to what you would receive via mysql. The query parameter should also be URL-encoded./cli endpoint accepts any SQL query and returns the response in raw format, similar to what you would receive via mysql. Unlike the /sql and /sql?mode=raw endpoints, the query parameter should not be URL-encoded. This endpoint is intended for manual actions using a browser or command line HTTP clients such as curl. It is not recommended to use the /cli endpoint in scripts./sql accepts an SQL SELECT query via HTTP JSON interface.
Query payload must be URL encoded, otherwise query statements with = (filtering or setting options) will result in an error.
It returns a JSON response which contains hits information and execution time. The response has the same format as json/search endpoint. Note, that /sql endpoint supports only single search requests. If you are looking for processing a multi-query see below.
POST /sql -d "query=select%20id%2Csubject%2Cauthor_id%20%20from%20forum%20where%20match%28%27%40subject%20php%20manticore%27%29%20group%20by%20author_id%20order%20by%20id%20desc%20limit%200%2C5"
{
"took": 0,
"timed_out": false,
"hits": {
"total": 2,
"total_relation": "eq",
"hits": [
{
"_id": "2",
"_score": 2356,
"_source": {
"subject": "php manticore",
"author_id": 12
}
},
{
"_id": "1",
"_score": 2356,
"_source": {
"subject": "php manticore",
"author_id": 11
}
}
]
}
}
/sql endpoint also has a special mode "raw", which allows to send any valid sphinxql queries including multi-queries. The returned value is a json array of one or more result sets.
POST /sql?mode=raw -d "query=desc%20test"
[
{
"columns": [
{
"Field": {
"type": "string"
}
},
{
"Type": {
"type": "string"
}
},
{
"Properties": {
"type": "string"
}
}
],
"data": [
{
"Field": "id",
"Type": "bigint",
"Properties": ""
},
{
"Field": "title",
"Type": "text",
"Properties": "indexed"
},
{
"Field": "gid",
"Type": "uint",
"Properties": ""
},
{
"Field": "title",
"Type": "string",
"Properties": ""
},
{
"Field": "j",
"Type": "json",
"Properties": ""
},
{
"Field": "new1",
"Type": "uint",
"Properties": ""
}
],
"total": 6,
"error": "",
"warning": ""
}
]
While the /sql endpoint is useful to control Manticore programmatically from your application, there's also endpoint /cli which makes it easier to maintain a Manticore instance via curl or your browser manually. It accepts POST and GET HTTP methods. Everything after /cli? is taken by Manticore as is, even if you don't escape it manually via curl or let the browser encode it automatically. The + sign is not decoded to a space as well, eliminating the necessity of encoding it. The response format is tabular, similar to the one returned by MySQL console.
POST /cli -d "desc test"
+-------+--------+----------------+
| Field | Type | Properties |
+-------+--------+----------------+
| id | bigint | |
| body | text | indexed stored |
| title | string | |
+-------+--------+----------------+
3 rows in set (0.001 sec)
The /cli_json endpoint provides the same functionality as /cli , but returns the response in JSON format.
POST /cli_json -d "desc test"
[{
"columns":[{"Field":{"type":"string"}},{"Type":{"type":"string"}},{"Properties":{"type":"string"}}],
"data":[
{"Field":"id","Type":"bigint","Properties":""},
{"Field":"body","Type":"text","Properties":"indexed stored"},
{"Field":"title","Type":"string","Properties":""}
],
"total":3,
"error":"",
"warning":""
}]
HTTP keep-alive is also supported, which makes working via the HTTP JSON interface stateful as long as the client supports keep-alive too. For example, using the new /cli endpoint you can call SHOW META after SELECT and it will work the same way it works via mysql.
You can connect to Manticore Search through HTTP/HTTPS.
By default, Manticore listens for HTTP, HTTPS, and binary requests on ports 9308 and 9312.
In the "searchd" section of your configuration file, you can define the HTTP port using the listen directive as follows:
Both lines are valid and have the same meaning (except for the port number). They both define listeners that will serve all API/HTTP/HTTPS protocols. There are no special requirements, and any HTTP client can be used to connect to Manticore.
searchd {
...
listen = 127.0.0.1:9308
listen = 127.0.0.1:9312:http
...
}
All HTTP endpoints return application/json content type. For the most part, endpoints use JSON payloads for requests. However, there are some exceptions that use NDJSON or simple URL-encoded payloads.
Currently, there is no user authentication. Therefore, make sure that the HTTP interface is not accessible to anyone outside your network. As Manticore functions like any other web server, you can use a reverse proxy, such as Nginx, to implement HTTP authentication or caching.
The HTTP protocol also supports SSL encryption:
If you specify :https instead of :http only secured connections will be accepted. Otherwise in case of no valid key/certificate provided, but the client trying to connect via https - the connection will be dropped. If you make not HTTPS, but an HTTP request to 9443 it will respond with HTTP code 400.
searchd {
...
listen = 127.0.0.1:9308
listen = 127.0.0.1:9443:https
...
}
Separate HTTP interface can be used for 'VIP' connections. In this case, the connection bypasses a thread pool and always creates a new dedicated thread. This is useful for managing Manticore Search during periods of severe overload when the server might stall or not allow regular port connections.
For more information on the listen directive, see this section.
searchd {
...
listen = 127.0.0.1:9308
listen = 127.0.0.1:9318:_vip
...
}
Endpoints /sql and /cli allow running SQL queries via HTTP.
/sql endpoint accepts only SELECT statements and returns the response in HTTP JSON format. The query parameter should be URL-encoded./sql?mode=raw endpoint accepts any SQL query and returns the response in raw format, similar to what you would receive via mysql. The query parameter should also be URL-encoded./cli endpoint accepts any SQL query and returns the response in raw format, similar to what you would receive via mysql. Unlike the /sql and /sql?mode=raw endpoints, the query parameter should not be URL-encoded. This endpoint is intended for manual actions using a browser or command line HTTP clients such as curl. It is not recommended to use the /cli endpoint in scripts./sql accepts an SQL SELECT query via HTTP JSON interface.
Query payload must be URL encoded, otherwise query statements with = (filtering or setting options) will result in an error.
It returns a JSON response which contains hits information and execution time. The response has the same format as json/search endpoint. Note, that /sql endpoint supports only single search requests. If you are looking for processing a multi-query see below.
POST /sql -d "query=select%20id%2Csubject%2Cauthor_id%20%20from%20forum%20where%20match%28%27%40subject%20php%20manticore%27%29%20group%20by%20author_id%20order%20by%20id%20desc%20limit%200%2C5"
{
"took": 0,
"timed_out": false,
"hits": {
"total": 2,
"total_relation": "eq",
"hits": [
{
"_id": "2",
"_score": 2356,
"_source": {
"subject": "php manticore",
"author_id": 12
}
},
{
"_id": "1",
"_score": 2356,
"_source": {
"subject": "php manticore",
"author_id": 11
}
}
]
}
}
/sql endpoint also has a special mode "raw", which allows to send any valid sphinxql queries including multi-queries. The returned value is a json array of one or more result sets.
POST /sql?mode=raw -d "query=desc%20test"
[
{
"columns": [
{
"Field": {
"type": "string"
}
},
{
"Type": {
"type": "string"
}
},
{
"Properties": {
"type": "string"
}
}
],
"data": [
{
"Field": "id",
"Type": "bigint",
"Properties": ""
},
{
"Field": "title",
"Type": "text",
"Properties": "indexed"
},
{
"Field": "gid",
"Type": "uint",
"Properties": ""
},
{
"Field": "title",
"Type": "string",
"Properties": ""
},
{
"Field": "j",
"Type": "json",
"Properties": ""
},
{
"Field": "new1",
"Type": "uint",
"Properties": ""
}
],
"total": 6,
"error": "",
"warning": ""
}
]
While the /sql endpoint is useful to control Manticore programmatically from your application, there's also endpoint /cli which makes it easier to maintain a Manticore instance via curl or your browser manually. It accepts POST and GET HTTP methods. Everything after /cli? is taken by Manticore as is, even if you don't escape it manually via curl or let the browser encode it automatically. The + sign is not decoded to a space as well, eliminating the necessity of encoding it. The response format is tabular, similar to the one returned by MySQL console.
POST /cli -d "desc test"
+-------+--------+----------------+
| Field | Type | Properties |
+-------+--------+----------------+
| id | bigint | |
| body | text | indexed stored |
| title | string | |
+-------+--------+----------------+
3 rows in set (0.001 sec)
The /cli_json endpoint provides the same functionality as /cli , but returns the response in JSON format.
POST /cli_json -d "desc test"
[{
"columns":[{"Field":{"type":"string"}},{"Type":{"type":"string"}},{"Properties":{"type":"string"}}],
"data":[
{"Field":"id","Type":"bigint","Properties":""},
{"Field":"body","Type":"text","Properties":"indexed stored"},
{"Field":"title","Type":"string","Properties":""}
],
"total":3,
"error":"",
"warning":""
}]
HTTP keep-alive is also supported, which makes working via the HTTP JSON interface stateful as long as the client supports keep-alive too. For example, using the new /cli endpoint you can call SHOW META after SELECT and it will work the same way it works via mysql.
You can add, update, replace, and delete your indexed data using different ways provided by Manticore. Manticore supports working with external storages such as databases, XML, CSV, and TSV documents. For insert and delete operations, a transaction mechanism is supported.
Also, for insert and replace queries, Manticore supports Elasticsearch-like query format along with its own format. For details, see the corresponding examples in the Adding documents to a real-time table and REPLACE sections.
If you're looking for information on adding documents to a plain table, please refer to the section on adding data from external storages.
Adding documents in real-time is supported only for Real-Time and percolate tables. The corresponding SQL command, HTTP endpoint, or client functions insert new rows (documents) into a table with the provided field values. It's not necessary for a table to exist before adding documents to it. If the table doesn't exist, Manticore will attempt to create it automatically. For more information, see Auto schema.
You can insert a single or multiple documents with values for all fields of the table or just a portion of them. In this case, the other fields will be filled with their default values (0 for scalar types, an empty string for text types).
Expressions are not currently supported in INSERT, so values must be explicitly specified.
The ID field/value can be omitted, as RT and PQ tables support auto-id functionality. You can also use 0 as the id value to force automatic ID generation. Rows with duplicate IDs will not be overwritten by INSERT. Instead, you can use REPLACE for that purpose.
When using the HTTP JSON protocol, you have two different request formats to choose from: a common Manticore format and an Elasticsearch-like format. Both formats are demonstrated in the examples below.
Additionally, when using the Manticore JSON request format, keep in mind that the doc node is required, and all the values should be provided within it.
INSERT INTO <table name> [(column, ...)]
VALUES (value, ...)
[, (...)]
INSERT INTO products(title,price) VALUES ('Crossbody Bag with Tassel', 19.85);
INSERT INTO products(title) VALUES ('Crossbody Bag with Tassel');
INSERT INTO products VALUES (0,'Yellow bag', 4.95);
Query OK, 1 rows affected (0.00 sec)
Query OK, 1 rows affected (0.00 sec)
Query OK, 1 rows affected (0.00 sec)
POST /insert
{
"index":"products",
"id":1,
"doc":
{
"title" : "Crossbody Bag with Tassel",
"price" : 19.85
}
}
POST /insert
{
"index":"products",
"id":2,
"doc":
{
"title" : "Crossbody Bag with Tassel"
}
}
POST /insert
{
"index":"products",
"id":0,
"doc":
{
"title" : "Yellow bag"
}
}
{
"_index": "products",
"_id": 1,
"created": true,
"result": "created",
"status": 201
}
{
"_index": "products",
"_id": 2,
"created": true,
"result": "created",
"status": 201
}
{
"_index": "products",
"_id": 0,
"created": true,
"result": "created",
"status": 201
}
POST /products/_create/3
{
"title": "Yellow Bag with Tassel",
"price": 19.85
}
POST /products/_create/
{
"title": "Red Bag with Tassel",
"price": 19.85
}
{
"_id":3,
"_index":"products",
"_primary_term":1,
"_seq_no":0,
"_shards":{
"failed":0,
"successful":1,
"total":1
},
"_type":"_doc",
"_version":1,
"result":"updated"
}
{
"_id":2235747273424240642,
"_index":"products",
"_primary_term":1,
"_seq_no":0,
"_shards":{
"failed":0,
"successful":1,
"total":1
},
"_type":"_doc",
"_version":1,
"result":"updated"
}
$index->addDocuments([
['id' => 1, 'title' => 'Crossbody Bag with Tassel', 'price' => 19.85]
]);
$index->addDocuments([
['id' => 2, 'title' => 'Crossbody Bag with Tassel']
]);
$index->addDocuments([
['id' => 0, 'title' => 'Yellow bag']
]);
indexApi.insert({"index" : "test", "id" : 1, "doc" : {"title" : "Crossbody Bag with Tassel", "price" : 19.85}})
indexApi.insert({"index" : "test", "id" : 2, "doc" : {"title" : "Crossbody Bag with Tassel"}})
indexApi.insert({"index" : "test", "id" : 0, "doc" : {{"title" : "Yellow bag"}})
res = await indexApi.insert({"index" : "test", "id" : 1, "doc" : {"title" : "Crossbody Bag with Tassel", "price" : 19.85}});
res = await indexApi.insert({"index" : "test", "id" : 2, "doc" : {"title" : "Crossbody Bag with Tassel"}});
res = await indexApi.insert({"index" : "test", "id" : 0, "doc" : {{"title" : "Yellow bag"}});
InsertDocumentRequest newdoc = new InsertDocumentRequest();
HashMap<String,Object> doc = new HashMap<String,Object>(){{
put("title","Crossbody Bag with Tassel");
put("price",19.85);
}};
newdoc.index("products").id(1L).setDoc(doc);
sqlresult = indexApi.insert(newdoc);
newdoc = new InsertDocumentRequest();
HashMap<String,Object> doc = new HashMap<String,Object>(){{
put("title","Crossbody Bag with Tassel");
}};
newdoc.index("products").id(2L).setDoc(doc);
sqlresult = indexApi.insert(newdoc);
newdoc = new InsertDocumentRequest();
HashMap<String,Object> doc = new HashMap<String,Object>(){{
put("title","Yellow bag");
}};
newdoc.index("products").id(0L).setDoc(doc);
sqlresult = indexApi.insert(newdoc);
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("title", "Crossbody Bag with Tassel");
doc.Add("price", 19.85);
InsertDocumentRequest newdoc = new InsertDocumentRequest(index: "products", id: 1, doc: doc);
var sqlresult = indexApi.Insert(newdoc);
doc = new Dictionary<string, Object>();
doc.Add("title", "Crossbody Bag with Tassel");
newdoc = new InsertDocumentRequest(index: "products", id: 2, doc: doc);
sqlresult = indexApi.Insert(newdoc);
doc = new Dictionary<string, Object>();
doc.Add("title", "Yellow bag");
newdoc = new InsertDocumentRequest(index: "products", id: 0, doc: doc);
sqlresult = indexApi.Insert(newdoc);
Manticore features an automatic table creation mechanism, which activates when a specified table in the insert query doesn't yet exist. This mechanism is enabled by default. To disable it, set auto_schema = 0 in the Searchd section of your Manticore config file.
By default, all text values in the VALUES clause are considered to be of the text type, except for values representing valid email addresses, which are treated as the string type.
If you attempt to INSERT multiple rows with different, incompatible value types for the same field, auto table creation will be canceled, and an error message will be returned. However, if the different value types are compatible, the resulting field type will be the one that accommodates all the values. Some automatic data type conversions that may occur include:
Keep in mind that the /bulk HTTP endpoint does not support automatic table creation (auto schema). Only the /_bulk (Elasticsearch-like) HTTP endpoint and the SQL interface support this feature.
MySQL [(none)]> drop table if exists t; insert into t(i,f,t,s,j,b,m,mb) values(123,1.2,'text here','test@mail.com','{"a": 123}',1099511627776,(1,2),(1099511627776,1099511627777)); desc t; select * from t;
--------------
drop table if exists t
--------------
Query OK, 0 rows affected (0.42 sec)
--------------
insert into t(i,f,t,j,b,m,mb) values(123,1.2,'text here','{"a": 123}',1099511627776,(1,2),(1099511627776,1099511627777))
--------------
Query OK, 1 row affected (0.00 sec)
--------------
desc t
--------------
+-------+--------+----------------+
| Field | Type | Properties |
+-------+--------+----------------+
| id | bigint | |
| t | text | indexed stored |
| s | string | |
| j | json | |
| i | uint | |
| b | bigint | |
| f | float | |
| m | mva | |
| mb | mva64 | |
+-------+--------+----------------+
8 rows in set (0.00 sec)
--------------
select * from t
--------------
+---------------------+------+---------------+----------+------+-----------------------------+-----------+---------------+------------+
| id | i | b | f | m | mb | t | s | j |
+---------------------+------+---------------+----------+------+-----------------------------+-----------+---------------+------------+
| 5045949922868723723 | 123 | 1099511627776 | 1.200000 | 1,2 | 1099511627776,1099511627777 | text here | test@mail.com | {"a": 123} |
+---------------------+------+---------------+----------+------+-----------------------------+-----------+---------------+------------+
1 row in set (0.00 sec)
POST /insert -d
{
"index":"t",
"id": 2,
"doc":
{
"i" : 123,
"f" : 1.23,
"t": "text here",
"s": "test@mail.com",
"j": {"a": 123},
"b": 1099511627776,
"m": [1,2],
"mb": [1099511627776,1099511627777]
}
}
{"_index":"t","_id":2,"created":true,"result":"created","status":201}
Manticore provides an auto ID generation functionality for the column ID of documents inserted or replaced into a real-time or Percolate table. The generator produces a unique ID for a document with some guarantees, but it should not be considered an auto-incremented ID.
The generated ID value is guaranteed to be unique under the following conditions:
The auto ID generator creates a 64-bit integer for a document ID and uses the following schema:
This schema ensures that the generated ID is unique among all nodes in the cluster and that data inserted into different cluster nodes does not create collisions between the nodes.
As a result, the first ID from the generator used for auto ID is NOT 1 but a larger number. Additionally, the document stream inserted into a table might have non-sequential ID values if inserts into other tables occur between calls, as the ID generator is singular in the server and shared between all its tables.
INSERT INTO products(title,price) VALUES ('Crossbody Bag with Tassel', 19.85);
INSERT INTO products VALUES (0,'Yello bag', 4.95);
select * from products;
+---------------------+-----------+---------------------------+
| id | price | title |
+---------------------+-----------+---------------------------+
| 1657860156022587404 | 19.850000 | Crossbody Bag with Tassel |
| 1657860156022587405 | 4.950000 | Yello bag |
+---------------------+-----------+---------------------------+
POST /insert
{
"index":"products",
"id":0,
"doc":
{
"title" : "Yellow bag"
}
}
GET /search
{
"index":"products",
"query":{
"query_string":""
}
}
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1,
"hits": [
{
"_id": "1657860156022587406",
"_score": 1,
"_source": {
"price": 0,
"title": "Yellow bag"
}
}
]
}
}
$index->addDocuments([
['id' => 0, 'title' => 'Yellow bag']
]);
indexApi.insert({"index" : "products", "id" : 0, "doc" : {"title" : "Yellow bag"}})
res = await indexApi.insert({"index" : "products", "id" : 0, "doc" : {"title" : "Yellow bag"}});
newdoc = new InsertDocumentRequest();
HashMap<String,Object> doc = new HashMap<String,Object>(){{
put("title","Yellow bag");
}};
newdoc.index("products").id(0L).setDoc(doc);
sqlresult = indexApi.insert(newdoc);
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("title", "Yellow bag");
InsertDocumentRequest newdoc = new InsertDocumentRequest(index: "products", id: 0, doc: doc);
var sqlresult = indexApi.Insert(newdoc);
You can insert not just a single document into a real-time table, but as many as you'd like. It's perfectly fine to insert batches of tens of thousands of documents into a real-time table. However, it's important to keep the following points in mind:
Note that the /bulk HTTP endpoint does not support automatic creation of tables (auto schema). Only the /_bulk (Elasticsearch-like) HTTP endpoint and the SQL interface support this feature.
The /bulk (Manticore mode) endpoint supports Chunked transfer encoding. You can use it to transmit large batches. It:
max_packet_size (128MB), for example, 1GB at a time.INSERT INTO <table name>[(column1, column2, ...)] VALUES ()[,(value1,[value2, ...])]
INSERT INTO products(title,price) VALUES ('Crossbody Bag with Tassel', 19.85), ('microfiber sheet set', 19.99), ('Pet Hair Remover Glove', 7.99);
Query OK, 3 rows affected (0.01 sec)
POST /bulk
-H "Content-Type: application/x-ndjson" -d '
{"insert": {"index":"products", "id":1, "doc": {"title":"Crossbody Bag with Tassel","price" : 19.85}}}
{"insert":{"index":"products", "id":2, "doc": {"title":"microfiber sheet set","price" : 19.99}}}
'
POST /bulk
-H "Content-Type: application/x-ndjson" -d '
{"insert":{"index":"test1","id":21,"doc":{"int_col":1,"price":1.1,"title":"bulk doc one"}}}
{"insert":{"index":"test1","id":22,"doc":{"int_col":2,"price":2.2,"title":"bulk doc two"}}}
{"insert":{"index":"test1","id":23,"doc":{"int_col":3,"price":3.3,"title":"bulk doc three"}}}
{"insert":{"index":"test2","id":24,"doc":{"int_col":4,"price":4.4,"title":"bulk doc four"}}}
{"insert":{"index":"test2","id":25,"doc":{"int_col":5,"price":5.5,"title":"bulk doc five"}}}
'
{
"items": [
{
"bulk": {
"_index": "products",
"_id": 2,
"created": 2,
"deleted": 0,
"updated": 0,
"result": "created",
"status": 201
}
}
],
"current_line": 4,
"skipped_lines": 0,
"errors": false,
"error": ""
}
{
"items": [
{
"bulk": {
"_index": "test1",
"_id": 22,
"created": 2,
"deleted": 0,
"updated": 0,
"result": "created",
"status": 201
}
},
{
"bulk": {
"_index": "test1",
"_id": 23,
"created": 1,
"deleted": 0,
"updated": 0,
"result": "created",
"status": 201
}
},
{
"bulk": {
"_index": "test2",
"_id": 25,
"created": 2,
"deleted": 0,
"updated": 0,
"result": "created",
"status": 201
}
}
],
"current_line": 8,
"skipped_lines": 0,
"errors": false,
"error": ""
}
POST /_bulk
-H "Content-Type: application/x-ndjson" -d '
{ "index" : { "_index" : "products" } }
{ "title" : "Yellow Bag", "price": 12 }
{ "create" : { "_index" : "products" } }
{ "title" : "Red Bag", "price": 12.5, "id": 3 }
'
{
"items": [
{
"index": {
"_index": "products",
"_type": "doc",
"_id": "0",
"_version": 1,
"result": "created",
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1,
"status": 201
}
},
{
"create": {
"_index": "products",
"_type": "doc",
"_id": "3",
"_version": 1,
"result": "created",
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1,
"status": 201
}
}
],
"errors": false,
"took": 1
}
$index->addDocuments([
['id' => 1, 'title' => 'Crossbody Bag with Tassel', 'price' => 19.85],
['id' => 2, 'title' => 'microfiber sheet set', 'price' => 19.99],
['id' => 3, 'title' => 'Pet Hair Remover Glove', 'price' => 7.99]
]);
docs = [ \
{"insert": {"index" : "products", "id" : 1, "doc" : {"title" : "Crossbody Bag with Tassel", "price" : 19.85}}}, \
{"insert": {"index" : "products", "id" : 2, "doc" : {"title" : "microfiber sheet set", "price" : 19.99}}}, \
{"insert": {"index" : "products", "id" : 3, "doc" : {"title" : "CPet Hair Remover Glove", "price" : 7.99}}}
]
res = indexApi.bulk('\n'.join(map(json.dumps,docs)))
let docs = [
{"insert": {"index" : "products", "id" : 3, "doc" : {"title" : "Crossbody Bag with Tassel", "price" : 19.85}}},
{"insert": {"index" : "products", "id" : 4, "doc" : {"title" : "microfiber sheet set", "price" : 19.99}}},
{"insert": {"index" : "products", "id" : 5, "doc" : {"title" : "CPet Hair Remover Glove", "price" : 7.99}}}
];
res = await indexApi.bulk(docs.map(e=>JSON.stringify(e)).join('\n'));
String body = "{\"insert\": {\"index\" : \"products\", \"id\" : 1, \"doc\" : {\"title\" : \"Crossbody Bag with Tassel\", \"price\" : 19.85}}}"+"\n"+
"{\"insert\": {\"index\" : \"products\", \"id\" : 4, \"doc\" : {\"title\" : \"microfiber sheet set\", \"price\" : 19.99}}}"+"\n"+
"{\"insert\": {\"index\" : \"products\", \"id\" : 5, \"doc\" : {\"title\" : \"CPet Hair Remover Glove\", \"price\" : 7.99}}}"+"\n";
BulkResponse bulkresult = indexApi.bulk(body);
string body = "{\"insert\": {\"index\" : \"products\", \"id\" : 1, \"doc\" : {\"title\" : \"Crossbody Bag with Tassel\", \"price\" : 19.85}}}"+"\n"+
"{\"insert\": {\"index\" : \"products\", \"id\" : 4, \"doc\" : {\"title\" : \"microfiber sheet set\", \"price\" : 19.99}}}"+"\n"+
"{\"insert\": {\"index\" : \"products\", \"id\" : 5, \"doc\" : {\"title\" : \"CPet Hair Remover Glove\", \"price\" : 7.99}}}"+"\n";
BulkResponse bulkresult = indexApi.Bulk(string.Join("\n", docs));
Multi-value attributes (MVA) are inserted as arrays of numbers.
INSERT INTO products(title, sizes) VALUES('shoes', (40,41,42,43));
POST /insert
{
"index":"products",
"id":1,
"doc":
{
"title" : "shoes",
"sizes" : [40, 41, 42, 43]
}
}
POST /products/_create/1
{
"title": "shoes",
"sizes" : [40, 41, 42, 43]
}
POST /products/_doc/
{
"title": "shoes",
"sizes" : [40, 41, 42, 43]
}
$index->addDocument(
['title' => 'shoes', 'sizes' => [40,41,42,43]],
1
);
indexApi.insert({"index" : "products", "id" : 0, "doc" : {"title" : "Yellow bag","sizes":[40,41,42,43]}})
res = await indexApi.insert({"index" : "products", "id" : 0, "doc" : {"title" : "Yellow bag","sizes":[40,41,42,43]}});
newdoc = new InsertDocumentRequest();
HashMap<String,Object> doc = new HashMap<String,Object>(){{
put("title","Yellow bag");
put("sizes",new int[]{40,41,42,43});
}};
newdoc.index("products").id(0L).setDoc(doc);
sqlresult = indexApi.insert(newdoc);
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("title", "Yellow bag");
doc.Add("sizes", new List<Object> {40,41,42,43});
InsertDocumentRequest newdoc = new InsertDocumentRequest(index: "products", id: 0, doc: doc);
var sqlresult = indexApi.Insert(newdoc);
JSON value can be inserted as an escaped string (via SQL or JSON) or as a JSON object (via the JSON interface).
INSERT INTO products VALUES (1, 'shoes', '{"size": 41, "color": "red"}');
POST /insert
{
"index":"products",
"id":1,
"doc":
{
"title" : "shoes",
"meta" : {
"size": 41,
"color": "red"
}
}
}
POST /insert
{
"index":"products",
"id":1,
"doc":
{
"title" : "shoes",
"meta" : "{\"size\": 41, \"color\": \"red\"}"
}
}
POST /products/_create/1
{
"title": "shoes",
"meta" : {
"size": 41,
"color": "red"
}
}
POST /products/_doc/
{
"title": "shoes",
"meta" : {
"size": 41,
"color": "red"
}
}
$index->addDocument(
['title' => 'shoes', 'meta' => '{"size": 41, "color": "red"}'],
1
);
indexApi = api = manticoresearch.IndexApi(client)
indexApi.insert({"index" : "products", "id" : 0, "doc" : {"title" : "Yellow bag","meta":'{"size": 41, "color": "red"}'}})
res = await indexApi.insert({"index" : "products", "id" : 0, "doc" : {"title" : "Yellow bag","meta":'{"size": 41, "color": "red"}'}});
newdoc = new InsertDocumentRequest();
HashMap<String,Object> doc = new HashMap<String,Object>(){{
put("title","Yellow bag");
put("meta",
new HashMap<String,Object>(){{
put("size",41);
put("color","red");
}});
}};
newdoc.index("products").id(0L).setDoc(doc);
sqlresult = indexApi.insert(newdoc);
Dictionary<string, Object> meta = new Dictionary<string, Object>();
meta.Add("size", 41);
meta.Add("color", "red");
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("title", "Yellow bag");
doc.Add("meta", meta);
InsertDocumentRequest newdoc = new InsertDocumentRequest(index: "products", id: 0, doc: doc);
var sqlresult = indexApi.Insert(newdoc);
In a percolate table documents that are percolate query rules are stored and must follow the exact schema of four fields:
| field | type | description |
|---|---|---|
| id | bigint | PQ rule identifier (if omitted, it will be assigned automatically) |
| query | string | Full-text query (can be empty) compatible with the percolate table |
| filters | string | Additional filters by non-full-text fields (can be empty) compatible with the percolate table |
| tags | string | A string with one or many comma-separated tags, which may be used to selectively show/delete saved queries |
Any other field names are not supported and will trigger an error.
Warning: Inserting/replacing JSON-formatted PQ rules via SQL will not work. In other words, the JSON-specific operators (match etc) will be considered just parts of the rule's text that should match with documents. If you prefer JSON syntax, use the HTTP endpoint instead of INSERT/REPLACE.
INSERT INTO pq(id, query, filters) VALUES (1, '@title shoes', 'price > 5');
INSERT INTO pq(id, query, tags) VALUES (2, '@title bag', 'Louis Vuitton');
SELECT * FROM pq;
+------+--------------+---------------+---------+
| id | query | tags | filters |
+------+--------------+---------------+---------+
| 1 | @title shoes | | price>5 |
| 2 | @title bag | Louis Vuitton | |
+------+--------------+---------------+---------+
PUT /pq/pq_table/doc/1
{
"query": {
"match": {
"title": "shoes"
},
"range": {
"price": {
"gt": 5
}
}
},
"tags": ["Loius Vuitton"]
}
PUT /pq/pq_table/doc/2
{
"query": {
"ql": "@title shoes"
},
"filters": "price > 5",
"tags": ["Loius Vuitton"]
}
$newstoredquery = [
'index' => 'test_pq',
'body' => [
'query' => [
'match' => [
'title' => 'shoes'
]
],
'range' => [
'price' => [
'gt' => 5
]
]
],
'tags' => ['Loius Vuitton']
];
$client->pq()->doc($newstoredquery);
newstoredquery ={"index" : "test_pq", "id" : 2, "doc" : {"query": {"ql": "@title shoes"},"filters": "price > 5","tags": ["Loius Vuitton"]}}
indexApi.insert(newstoredquery)
newstoredquery ={"index" : "test_pq", "id" : 2, "doc" : {"query": {"ql": "@title shoes"},"filters": "price > 5","tags": ["Loius Vuitton"]}};
indexApi.insert(newstoredquery);
newstoredquery = new HashMap<String,Object>(){{
put("query",new HashMap<String,Object >(){{
put("q1","@title shoes");
put("filters","price>5");
put("tags",new String[] {"Loius Vuitton"});
}});
}};
newdoc.index("test_pq").id(2L).setDoc(doc);
indexApi.insert(newdoc);
Dictionary<string, Object> query = new Dictionary<string, Object>();
query.Add("q1", "@title shoes");
query.Add("filters", "price>5");
query.Add("tags", new List<string> {"Loius Vuitton"});
Dictionary<string, Object> newstoredquery = new Dictionary<string, Object>();
newstoredquery.Add("query", query);
InsertDocumentRequest newdoc = new InsertDocumentRequest(index: "test_pq", id: 2, doc: doc);
indexApi.Insert(newdoc);
If you don't specify an ID, it will be assigned automatically. You can read more about auto-ID here.
INSERT INTO pq(query, filters) VALUES ('wristband', 'price > 5');
SELECT * FROM pq;
+---------------------+-----------+------+---------+
| id | query | tags | filters |
+---------------------+-----------+------+---------+
| 1657843905795719192 | wristband | | price>5 |
+---------------------+-----------+------+---------+
PUT /pq/pq_table/doc
{
"query": {
"match": {
"title": "shoes"
},
"range": {
"price": {
"gt": 5
}
}
},
"tags": ["Loius Vuitton"]
}
PUT /pq/pq_table/doc
{
"query": {
"ql": "@title shoes"
},
"filters": "price > 5",
"tags": ["Loius Vuitton"]
}
{
"index": "pq_table",
"type": "doc",
"_id": "1657843905795719196",
"result": "created"
}
{
"index": "pq_table",
"type": "doc",
"_id": "1657843905795719198",
"result": "created"
}
$newstoredquery = [
'index' => 'pq_table',
'body' => [
'query' => [
'match' => [
'title' => 'shoes'
]
],
'range' => [
'price' => [
'gt' => 5
]
]
],
'tags' => ['Loius Vuitton']
];
$client->pq()->doc($newstoredquery);
Array(
[index] => pq_table
[type] => doc
[_id] => 1657843905795719198
[result] => created
)
indexApi = api = manticoresearch.IndexApi(client)
newstoredquery ={"index" : "test_pq", "doc" : {"query": {"ql": "@title shoes"},"filters": "price > 5","tags": ["Loius Vuitton"]}}
indexApi.insert(store_query)
{'created': True,
'found': None,
'id': 1657843905795719198,
'index': 'test_pq',
'result': 'created'}
newstoredquery ={"index" : "test_pq", "doc" : {"query": {"ql": "@title shoes"},"filters": "price > 5","tags": ["Loius Vuitton"]}};
res = await indexApi.insert(store_query);
{"_index":"test_pq","_id":1657843905795719198,"created":true,"result":"created"}
newstoredquery = new HashMap<String,Object>(){{
put("query",new HashMap<String,Object >(){{
put("q1","@title shoes");
put("filters","price>5");
put("tags",new String[] {"Loius Vuitton"});
}});
}};
newdoc.index("test_pq").setDoc(doc);
indexApi.insert(newdoc);
Dictionary<string, Object> query = new Dictionary<string, Object>();
query.Add("q1", "@title shoes");
query.Add("filters", "price>5");
query.Add("tags", new List<string> {"Loius Vuitton"});
Dictionary<string, Object> newstoredquery = new Dictionary<string, Object>();
newstoredquery.Add("query", query);
InsertDocumentRequest newdoc = new InsertDocumentRequest(index: "test_pq", doc: doc);
indexApi.Insert(newdoc);
In case of omitted schema in SQL INSERT command, the following parameters are expected:
1. ID. You can use 0 as the ID to trigger auto-ID generation.
2. Query - Full-text query.
3. Tags - PQ rule tags string.
4. Filters - Additional filters by attributes.
INSERT INTO pq VALUES (0, '@title shoes', '', '');
INSERT INTO pq VALUES (0, '@title shoes', 'Louis Vuitton', '');
SELECT * FROM pq;
+---------------------+--------------+---------------+---------+
| id | query | tags | filters |
+---------------------+--------------+---------------+---------+
| 2810855531667783688 | @title shoes | | |
| 2810855531667783689 | @title shoes | Louis Vuitton | |
+---------------------+--------------+---------------+---------+
To replace an existing PQ rule with a new one in SQL, just use a regular REPLACE command. There's a special syntax ?refresh=1 to replace a PQ rule defined in JSON mode via the HTTP JSON interface.
mysql> select * from pq;
+---------------------+--------------+------+---------+
| id | query | tags | filters |
+---------------------+--------------+------+---------+
| 2810823411335430148 | @title shoes | | |
+---------------------+--------------+------+---------+
1 row in set (0.00 sec)
mysql> replace into pq(id,query) values(2810823411335430148,'@title boots');
Query OK, 1 row affected (0.00 sec)
mysql> select * from pq;
+---------------------+--------------+------+---------+
| id | query | tags | filters |
+---------------------+--------------+------+---------+
| 2810823411335430148 | @title boots | | |
+---------------------+--------------+------+---------+
1 row in set (0.00 sec)
GET /pq/pq/doc/2810823411335430149
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1,
"hits": [
{
"_id": "2810823411335430149",
"_score": 1,
"_source": {
"query": {
"match": {
"title": "shoes"
}
},
"tags": "",
"filters": ""
}
}
]
}
}
PUT /pq/pq/doc/2810823411335430149?refresh=1 -d '{
"query": {
"match": {
"title": "boots"
}
}
}'
GET /pq/pq/doc/2810823411335430149
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1,
"hits": [
{
"_id": "2810823411335430149",
"_score": 1,
"_source": {
"query": {
"match": {
"title": "boots"
}
},
"tags": "",
"filters": ""
}
}
]
}
}
Plain tables are tables that are created one-time by fetching data at creation from one or several sources. A plain table is immutable as documents cannot be added or deleted during its lifespan. It is only possible to update values of numeric attributes (including MVA). Refreshing the data is only possible by recreating the whole table.
Plain tables are available only in the Plain mode and their definition is made up of a table declaration and one or several source declarations. The data gathering and table creation are not made by the searchd server but by the auxiliary tool indexer.
Indexer is a command-line tool that can be called directly from the command line or from shell scripts.
It can accept a number of arguments when called, but there are also several settings of its own in the Manticore configuration file.
In the typical scenario, indexer does the following:
The indexer tool is used to create plain tables in Manticore Search. It has a general syntax of:
indexer [OPTIONS] [table_name1 [table_name2 [...]]]
When creating tables with indexer, the generated table files must be made with permissions that allow searchd to read, write, and delete them. In case of the official Linux packages, searchd runs under the manticore user. Therefore, indexer must also run under the manticore user:
sudo -u manticore indexer ...
If you are running searchd differently, you might need to omit sudo -u manticore. Just make sure that the user under which your searchd instance is running has read/write permissions to the tables generated using indexer.
To create a plain table, you need to list the table(s) you want to process. For example, if your manticore.conf file contains details on two tables, mybigindex and mysmallindex, you could run:
sudo -u manticore indexer mysmallindex mybigindex
You can also use wildcard tokens to match table names:
? matches any single character* matches any count of any characters% matches none or any single charactersudo -u manticore indexer indexpart*main --rotate
The exit codes for indexer are as follows:
--rotate was specified, it was skipped) or an operation emitted a warning--rotate attempt failedAlso, you can run indexer via a systemctl unit file:
systemctl start --no-block manticore-indexer
Or, in case you want to build a specific table:
systemctl start --no-block manticore-indexer@specific-table-name
Find more information about scheduling indexer via systemd below.
--config <file> (-c <file> for short) tells indexer to use the given file as its configuration. Normally, it will look for manticore.conf in the installation directory (e.g. /etc/manticoresearch/manticore.conf), followed by the current directory you are in when calling indexer from the shell. This is most useful in shared environments where the binary files are installed in a global folder, e.g. /usr/bin/, but you want to provide users with the ability to make their own custom Manticore set-ups, or if you want to run multiple instances on a single server. In cases like those you could allow them to create their own manticore.conf files and pass them to indexer with this option. For example:shell
sudo -u manticore indexer --config /home/myuser/manticore.conf mytable
--all tells indexer to update every table listed in manticore.conf instead of listing individual tables. This would be useful in small configurations or cron-kind or maintenance jobs where the entire table set will get rebuilt each day or week or whatever period is best. Please note that since --all tries to update all found tables in the configuration, it will issue a warning if it encounters RealTime tables and the exit code of the command will be 1 not 0 even if the plain tables finished without issue. Example usage:shell
sudo -u manticore indexer --config /home/myuser/manticore.conf --all
--rotate is used for rotating tables. Unless you have the situation where you can take the search function offline without troubling users you will almost certainly need to keep search running whilst indexing new documents. --rotate creates a second table, parallel to the first (in the same place, simply including .new in the filenames). Once complete, indexer notifies searchd via sending the SIGHUP signal, and the searchd will attempt to rename the tables (renaming the existing ones to include .old and renaming the .new to replace them), and then will start serving from the newer files. Depending on the setting of seamless_rotate there may be a slight delay in being able to search the newer tables. In case multiple tables are rotated at once which are chained by killlist_target relations rotation will start with the tables that are not targets and finish with the ones at the end of target chain. Example usage:shell
sudo -u manticore indexer --rotate --all
--quiet tells indexer ot to output anything, unless there is an error. This is mostly used for cron-type or other scripted jobs where the output is irrelevant or unnecessary, except in the event of some kind of error. Example usage:shell
sudo -u manticore indexer --rotate --all --quiet
--noprogress does not display progress details as they occur. Instead, the final status details (such as documents indexed, speed of indexing and so on are only reported at completion of indexing. In instances where the script is not being run on a console (or 'tty'), this will be on by default. Example usage:shell
sudo -u manticore indexer --rotate --all --noprogress
--buildstops <outputfile.text> <N> reviews the table source, as if it were indexing the data, and produces a list of the terms that are being indexed. In other words, it produces a list of all the searchable terms that are becoming part of the table. Note, it does not update the table in question, it simply processes the data as if it were indexing, including running queries defined with sql_query_pre or sql_query_post. outputfile.txt will contain the list of words, one per line, sorted by frequency with most frequent first, and N specifies the maximum number of words that will be listed. If it's sufficiently large to encompass every word in the table, only that many words will be returned. Such a dictionary list could be used for client application features around "Did you mean…" functionality, usually in conjunction with --buildfreqs, below. Example:shell
sudo -u manticore indexer mytable --buildstops word_freq.txt 1000
This would produce a document in the current directory, word_freq.txt, with the 1,000 most common words in 'mytable', ordered by most common first. Note that the file will pertain to the last table indexed when specified with multiple tables or --all (i.e. the last one listed in the configuration file)
--buildfreqs works with --buildstops (and is ignored if --buildstops is not specified). As --buildstops provides the list of words used within the table, --buildfreqs adds the quantity present in the table, which would be useful in establishing whether certain words should be considered stopwords if they are too prevalent. It will also help with developing "Did you mean…" features where you need to know how much more common a given word compared to another, similar one. For example:shell
sudo -u manticore indexer mytable --buildstops word_freq.txt 1000 --buildfreqs
This would produce the word_freq.txt as above, however after each word would be the number of times it occurred in the table in question.
--merge <dst-table> <src-table> is used for physically merging tables together, for example if you have a main+delta scheme, where the main table rarely changes, but the delta table is rebuilt frequently, and --merge would be used to combine the two. The operation moves from right to left - the contents of src-table get examined and physically combined with the contents of dst-table and the result is left in dst-table. In pseudo-code, it might be expressed as: dst-table += src-table An example:shell
sudo -u manticore indexer --merge main delta --rotate
In the above example, where the main is the master, rarely modified table, and the delta is more frequently modified one, you might use the above to call indexer to combine the contents of the delta into the main table and rotate the tables.
--merge-dst-range <attr> <min> <max> runs the filter range given upon merging. Specifically, as the merge is applied to the destination table (as part of --merge, and is ignored if --merge is not specified), indexer will also filter the documents ending up in the destination table, and only documents will pass through the filter given will end up in the final table. This could be used for example, in a table where there is a 'deleted' attribute, where 0 means 'not deleted'. Such a table could be merged with:shell
sudo -u manticore indexer --merge main delta --merge-dst-range deleted 0 0
Any documents marked as deleted (value 1) will be removed from the newly-merged destination table. It can be added several times to the command line, to add successive filters to the merge, all of which must be met in order for a document to become part of the final table.
merge-killlists (and its shorter alias --merge-klists) changes the way kill lists are processed when merging tables. By default, both kill lists get discarded after a merge. That supports the most typical main+delta merge scenario. With this option enabled, however, kill lists from both tables get concatenated and stored into the destination table. Note that a source (delta) table kill list will be used to suppress rows from a destination (main) table at all times.--keep-attrs allows to reuse existing attributes on reindexing. Whenever the table is rebuilt, each new document id is checked for presence in the "old" table, and if it already exists, its attributes are transferred to the "new" table; if not found, attributes from the new table are used. If the user has updated attributes in the table, but not in the actual source used for the table, all updates will be lost when reindexing; using --keep-attrs enables saving the updated attribute values from the previous table. It is possible to specify a path for table files to be used instead of the reference path from the config:shell
sudo -u manticore indexer mytable --keep-attrs=/path/to/index/files
--keep-attrs-names=<attributes list> allows you to specify attributes to reuse from an existing table on reindexing. By default, all attributes from the existing table are reused in the new table:shell
sudo -u manticore indexer mytable --keep-attrs=/path/to/table/files --keep-attrs-names=update,state
--dump-rows <FILE> dumps rows fetched by SQL source(s) into the specified file, in a MySQL compatible syntax. The resulting dumps are the exact representation of data as received by indexer and can help repeat indexing-time issues. The command performs fetching from the source and creates both table files and the dump file.--print-rt <rt_index> <table> outputs fetched data from the source as INSERTs for a real-time table. The first lines of the dump will contain the real-time fields and attributes (as a reflection of the plain table fields and attributes). The command performs fetching from the source and creates both table files and the dump output. The command can be used as sudo -u manticore indexer -c manticore.conf --print-rt indexrt indexplain > dump.sql. Only SQL-based sources are supported. MVAs are not supported.--sighup-each is useful when you are rebuilding many big tables and want each one rotated into searchd as soon as possible. With --sighup-each, indexer will send the SIGHUP signal to searchd after successfully completing work on each table. (The default behavior is to send a single SIGHUP after all the tables are built).--nohup is useful when you want to check your table with indextool before actually rotating it. indexer won't send the SIGHUP if this option is on. Table files are renamed to .tmp. Use indextool to rename table files to .new and rotate it. Example usage:shell
sudo -u manticore indexer --rotate --nohup mytable
sudo -u manticore indextool --rotate --check mytable
--print-queries prints out SQL queries that indexer sends to the database, along with SQL connection and disconnection events. That is useful to diagnose and fix problems with SQL sources.--help (-h for short) lists all the parameters that can be called in indexer.-v shows indexer version.You can also configure indexer behavior in the Manticore configuration file in the indexer section:
indexer {
...
}
lemmatizer_cache = 256M
Lemmatizer cache size. Optional, default is 256K.
Our lemmatizer implementation uses a compressed dictionary format that enables a space/speed tradeoff. It can either perform lemmatization off the compressed data, using more CPU but less RAM, or it can decompress and precache the dictionary either partially or fully, thus using less CPU but more RAM. The lemmatizer_cache directive lets you control how much RAM exactly can be spent for that uncompressed dictionary cache.
Currently, the only available dictionaries are ru.pak, en.pak, and de.pak. These are the Russian, English, and German dictionaries. The compressed dictionary is approximately 2 to 10 MB in size. Note that the dictionary stays in memory at all times too. The default cache size is 256 KB. The accepted cache sizes are 0 to 2047 MB. It's safe to raise the cache size too high; the lemmatizer will only use the needed memory. For example, the entire Russian dictionary decompresses to approximately 110 MB; thus settinglemmatizer_cache higher than that will not affect the memory use. Even when 1024 MB is allowed for the cache, if only 110 MB is needed, it will only use those 110 MB.
max_file_field_buffer = 128M
Maximum file field adaptive buffer size in bytes. Optional, default is 8MB, minimum is 1MB.
The file field buffer is used to load files referred to from sql_file_field columns. This buffer is adaptive, starting at 1 MB at first allocation, and growing in 2x steps until either the file contents can be loaded or the maximum buffer size, specified by the max_file_field_buffer directive, is reached.
Thus, if no file fields are specified, no buffer is allocated at all. If all files loaded during indexing are under (for example) 2 MB in size, but the max_file_field_buffer value is 128 MB, the peak buffer usage would still be only 2 MB. However, files over 128 MB would be entirely skipped.
max_iops = 40
Maximum I/O operations per second, for I/O throttling. Optional, default is 0 (unlimited).
I/O throttling related option. It limits the maximum count of I/O operations (reads or writes) per any given second. A value of 0 means that no limit is imposed.
indexer can cause bursts of intensive disk I/O during building a table, and it might be desirable to limit its disk activity (and reserve something for other programs running on the same machine, such as searchd). I/O throttling helps to do that. It works by enforcing a minimum guaranteed delay between subsequent disk I/O operations performed by indexer. Throttling I/O can help reduce search performance degradation caused by building. This setting is not effective for other kinds of data ingestion, e.g. inserting data into a real-time table.
max_iosize = 1048576
Maximum allowed I/O operation size, in bytes, for I/O throttling. Optional, default is 0 (unlimited).
I/O throttling related option. It limits the maximum file I/O operation (read or write) size for all operations performed by indexer. A value of 0 means that no limit is imposed. Reads or writes that are bigger than the limit will be split into several smaller operations, and counted as several operations by the max_iops setting. At the time of this writing, all I/O calls should be under 256 KB (default internal buffer size) anyway, so max_iosize values higher than 256 KB should not have any effect.
max_xmlpipe2_field = 8M
Maximum allowed field size for XMLpipe2 source type, in bytes. Optional, default is 2 MB.
mem_limit = 256M
# mem_limit = 262144K # same, but in KB
# mem_limit = 268435456 # same, but in bytes
Plain table building RAM usage limit. Optional, default is 128 MB. Enforced memory usage limit that the indexer will not go above. Can be specified in bytes, or kilobytes (using K postfix), or megabytes (using M postfix); see the example. This limit will be automatically raised if set to an extremely low value causing I/O buffers to be less than 8 KB; the exact lower bound for that depends on the built data size. If the buffers are less than 256 KB, a warning will be produced.
The maximum possible limit is 2047M. Too low values can hurt plain table building speed, but 256M to 1024M should be enough for most, if not all datasets. Setting this value too high can cause SQL server timeouts. During the document collection phase, there will be periods when the memory buffer is partially sorted and no communication with the database is performed; and the database server can timeout. You can resolve that either by raising timeouts on the SQL server side or by lowering mem_limit.
on_file_field_error = skip_document
How to handle IO errors in file fields. Optional, default is ignore_field.
When there is a problem indexing a file referenced by a file field (sql_file_field), indexer can either process the document, assuming empty content in this particular field, or skip the document, or fail indexing entirely. on_file_field_error directive controls that behavior. The values it takes are:
ignore_field, process the current document without field;skip_document, skip the current document but continue indexing;fail_index, fail indexing with an error message.The problems that can arise are: open error, size error (file too big), and data read error. Warning messages on any problem will be given at all times, regardless of the phase and the on_file_field_error setting.
Note that with on_file_field_error = skip_document documents will only be ignored if problems are detected during an early check phase, and not during the actual file parsing phase. indexer will open every referenced file and check its size before doing any work, and then open it again when doing actual parsing work. So in case a file goes away between these two open attempts, the document will still be indexed.
write_buffer = 4M
Write buffer size, bytes. Optional, default is 1MB. Write buffers are used to write both temporary and final table files when indexing. Larger buffers reduce the number of required disk writes. Memory for the buffers is allocated in addition to mem_limit. Note that several (currently up to 4) buffers for different files will be allocated, proportionally increasing the RAM usage.
ignore_non_plain = 1
ignore_non_plain allows you to completely ignore warnings about skipping non-plain tables. The default is 0 (not ignoring).
There are two approaches to scheduling indexer runs. The first way is the classical method of using crontab. The second way is using a systemd timer with a user-defined schedule. To create the timer unit files, you should place them in the appropriate directory where systemd looks for such unit files. On most Linux distributions, this directory is typically /etc/systemd/system. Here's how to do it:
cat << EOF > /etc/systemd/system/manticore-indexer@.timer
[Unit]
Description=Run Manticore Search's indexer on schedule
[Timer]
OnCalendar=minutely
RandomizedDelaySec=5m
Unit=manticore-indexer@%i.service
[Install]
WantedBy=timers.target
EOF
More on the OnCalendar syntax and examples can be found here.
systemctl enable manticore-indexer@idx1.timer
systemctl start manticore-indexer@idx1.timer
Manticore Search allows fetching data from databases using specialized drivers or ODBC. Current drivers include:
mysql - for MySQL/MariaDB/Percona MySQL databasespgsql - for PostgreSQL databasemssql - for Microsoft SQL databaseodbc - for any database that accepts connections using ODBCTo fetch data from the database, a source must be configured with type as one of the above. The source requires information about how to connect to the database and the query that will be used to fetch the data. Additional pre- and post-queries can also be set - either to configure session settings or to perform pre/post fetch tasks. The source also must contain definitions of data types for the columns that are fetched.
The source definition must contain the settings of the connection, this includes the host, port, user credentials, or specific settings of a driver.
The database server host to connect to. Note that the MySQL client library chooses whether to connect over TCP/IP or over UNIX socket based on the host name. Specifically, "localhost" will force it to use UNIX socket (this is the default and generally recommended mode) and "127.0.0.1" will force TCP/IP usage.
The server IP port to connect to.
For mysql the default is 3306 and for pgsql, it is 5432.
The SQL database to use after the connection is established and perform further queries within.
The username used for connecting.
The user password to use when connecting. If the password includes # (which can be used to add comments in the configuration file), you can escape it with \.
UNIX socket name to connect to for local database servers. Note that it depends on the sql_host setting whether this value will actually be used.
sql_sock = /var/lib/mysql/mysql.sock
MySQL client connection flags. Optional, the default value is 0 (do not set any flags).
This option must contain an integer value with the sum of the flags. The value will be passed to mysql_real_connect() verbatim. The flags are enumerated in mysql_com.h include file. Flags that are especially interesting in regard to indexing, with their respective values, are as follows:
mysql_connect_flags = 32 # enable compression
mysql_ssl_cert - path to SSL certificatemysql_ssl_key - path to SSL key filemysql_ssl_ca - path to CA certificateunpack_mysqlcompress_maxsize = 1M
Columns to unpack using MySQL UNCOMPRESS() algorithm. Multi-value, optional, default value is an empty list of columns.
Columns specified using this directive will be unpacked by the indexer using the modified zlib algorithm used by MySQL COMPRESS() and UNCOMPRESS() functions. When indexing on a different box than the database, this lets you offload the database and save on network traffic. This feature is only available if zlib and zlib-devel were both available during build time.
unpack_mysqlcompress = body_compressed
unpack_mysqlcompress = description_compressed
By default, a buffer of 16M is used for uncompressing the data. This can be changed by setting unpack_mysqlcompress_maxsize.
When using unpack_mysqlcompress, due to implementation intricacies, it is not possible to deduce the required buffer size from the compressed data. So, the buffer must be preallocated in advance, and the unpacked data can not go over the buffer size.
unpack_zlib = col1
unpack_zlib = col2
Columns to unpack using zlib (aka deflate, aka gunzip). Multi-value, optional, default value is an empty list of columns. Applies to source types mysql and pgsql only.
Columns specified using this directive will be unpacked by the indexer using the standard zlib algorithm (called deflate and also implemented by gunzip). When indexing on a different box than the database, this lets you offload the database and save on network traffic. This feature is only available if zlib and zlib-devel were both available during build time.
MS SQL Windows authentication flag. Whether to use currently logged-in Windows account credentials for authentication when connecting to MS SQL Server.
mssql_winauth = 1
Sources using ODBC require the presence of a DSN (Data Source Name) string which can be set with odbc_dsn.
odbc_dsn = Driver={Oracle ODBC Driver};Dbq=myDBName;Uid=myUsername;Pwd=myPassword
Please note that the format depends on the specific ODBC driver used.
With all the SQL drivers, building a plain table generally works as follows.
sql_query_pre_all queries are executed to perform any necessary initial setup, such as setting per-connection encoding with MySQL. These queries run before the entire indexing process, and also after a reconnect for indexing MVA attributes and joined fields.sql_query_pre pre-query is executed to perform any necessary initial setup, such as setting up temporary tables or maintaining counter tables. These queries run once for the entire indexing process.Pre-queries as sql_query_pre is executed to perform any necessary initial setup, such as setting up temporary
tables, or maintaining counter table. These queries run once per whole indexing.
Main query as sql_query is executed and the rows it returns are processed.
sql_query_post is executed to perform some necessary cleanup.sql_query_post_index is executed to perform some necessary final cleanup.Example of a source fetching data from MYSQL:
source mysource {
type = mysql
path = /path/to/realtime
sql_host = localhost
sql_user = myuser
sql_pass = mypass
sql_db = mydb
sql_query_pre = SET CHARACTER_SET_RESULTS=utf8
sql_query_pre = SET NAMES utf8
sql_query = SELECT id, title, description, category_id FROM mytable
sql_query_post = DROP TABLE view_table
sql_query_post_index = REPLACE INTO counters ( id, val ) \
VALUES ( 'max_indexed_id', $maxid )
sql_attr_uint = category_id
sql_field_string = title
}
table mytable {
type = plain
source = mysource
path = /path/to/mytable
...
}
This is the query used to retrieve documents from a SQL server. There can be only one sql_query declared, and it's mandatory to have one. See also Processing fetched data
Pre-fetch query or pre-query. This is a multi-value, optional setting, with the default being an empty list of queries. The pre-queries are executed before the sql_query in the order they appear in the configuration file. The results of the pre-queries are ignored.
Pre-queries are useful in many ways. They can be used to set up encoding, mark records that are going to be indexed, update internal counters, set various per-connection SQL server options and variables, and so on.
Perhaps the most frequent use of pre-query is to specify the encoding that the server will use for the rows it returns. Note that Manticore accepts only UTF-8 text. Two MySQL specific examples of setting the encoding are:
sql_query_pre = SET CHARACTER_SET_RESULTS=utf8
sql_query_pre = SET NAMES utf8
Also, specific to MySQL sources, it is useful to disable query cache (for indexer connection only) in pre-query, because indexing queries are not going to be re-run frequently anyway, and there's no sense in caching their results.
That could be achieved with:
sql_query_pre = SET SESSION query_cache_type=OFF
Post-fetch query. This is an optional setting, with the default value being empty.
This query is executed immediately after sql_query completes successfully. When the post-fetch query produces errors, they are reported as warnings, but indexing is not terminated. Its result set is ignored. Note that indexing is not yet completed at the point when this query gets executed, and further indexing may still fail. Therefore, any permanent updates should not be done from here. For instance, updates on a helper table that permanently change the last successfully indexed ID should not be run from the sql_query_post query; they should be run from the sql_query_post_index query instead.
Post-processing query. This is an optional setting, with the default value being empty.
This query is executed when indexing is fully and successfully completed. If this query produces errors, they are reported as warnings, but indexing is not terminated. Its result set is ignored. The $maxid macro can be used in its text; it will be expanded to the maximum document ID that was actually fetched from the database during indexing. If no documents were indexed, $maxid will be expanded to 0.
Example:
sql_query_post_index = REPLACE INTO counters ( id, val ) \
VALUES ( 'max_indexed_id', $maxid )
The difference between sql_query_post and sql_query_post_index is that sql_query_post is run immediately when Manticore receives all the documents, but further indexing may still fail for some other reason. On the contrary, by the time the sql_query_post_index query gets executed, it is guaranteed that the table was created successfully. Database connection is dropped and re-established because sorting phase can be very lengthy and would just time out otherwise.
By default, the first column from the result set of sql_query is indexed as the document id.
Document ID MUST be the very first field, and it MUST BE UNIQUE SIGNED (NON-ZERO) INTEGER NUMBER from -9223372036854775808 to 9223372036854775807.
You can specify up to 256 full-text fields and an arbitrary amount of attributes. All the columns that are neither document ID (the first one) nor attributes will be indexed as full-text fields.
Declares a 64-bit signed integer.
Declares a boolean attribute. It's equivalent to an integer attribute with bit count of 1.
Declares a floating point attribute.
The values will be stored in single precision, 32-bit IEEE 754 format. Represented range is approximately from 1e-38 to 1e+38. The amount of decimal digits that can be stored precisely is approximately 7.
One important usage of float attributes is storing latitude and longitude values (in radians), for further usage in query-time geosphere distance calculations.
Declares a JSON attribute.
When indexing JSON attributes, Manticore expects a text field with JSON formatted data. JSON attributes support arbitrary JSON data with no limitation in nested levels or types.
Declares a multi-value attribute.
Plain attributes only allow attaching 1 value per each document. However, there are cases (such as tags or categories) when it is desired to attach multiple values of the same attribute and be able to apply filtering or grouping to value lists.
The MVA can take the values from a column (like the rest of the data types) - in this case, the column in the result set must provide a string with multiple integer values separated by commas - or by running a separate query to get the values.
When executing a query, the engine runs the query, groups the results by IDs, and assigns the values to their corresponding documents in the table. Values with an ID not found in the table are discarded. Before executing the query, any defined sql_query_pre_all will be run.
The declaration format for sql_attr_multi is as follows:
sql_attr_multi = ATTR-TYPE ATTR-NAME 'from' SOURCE-TYPE \
[;QUERY] \
[;RANGED-QUERY]
where
uint, bigint or timestamp.field, query, ranged-query, or ranged-main-query.sql_query_range.It's used with ranged-query SOURCE-TYPE. If using ranged-main-query SOURCE-TYPE, then omit the RANGED-QUERY, and it will automatically use the same query from sql_query_range(useful option in complex inheritance setups to save having to manually duplicate the same query many times).
sql_attr_multi = uint tag from field
sql_attr_multi = uint tag from query; SELECT id, tag FROM tags
sql_attr_multi = bigint tag from ranged-query; \
SELECT id, tag FROM tags WHERE id>=$start AND id<=$end; \
SELECT MIN(id), MAX(id) FROM tags
Declares a string attribute. The maximum size of each value is fixed at 4GB.
Declares a UNIX timestamp.
Timestamps can store dates and times in the range of January 01, 1970, to January 19, 2038, with a precision of one second. The expected column value should be a timestamp in UNIX format, which is a 32-bit unsigned integer number of seconds elapsed since midnight on January 01, 1970, GMT. Timestamps are internally stored and handled as integers everywhere. In addition to working with timestamps as integers, you can also use them with different date-based functions, such as time segments sorting mode or day/week/month/year extraction for GROUP BY.
Note that DATE or DATETIME column types in MySQL cannot be directly used as timestamp attributes in Manticore; you need to explicitly convert such columns using UNIX_TIMESTAMP function (if the data is in range).
Note timestamps can not represent dates before January 01, 1970, and UNIX_TIMESTAMP() in MySQL will not return anything expected. If you only need to work with dates, not times, consider TO_DAYS() function in MySQL instead.
Declares an unsigned integer attribute.
You can specify the bit count for integer attributes by appending ':BITCOUNT' to attribute name (see example below). Attributes with less than default 32-bit size, or bitfields, perform slower.
sql_attr_uint = group_id
sql_attr_uint = forum_id:9 # 9 bits for forum_id
Declares a combo string attribute/text field. The values will be indexed as a full-text field, but also stored in a string attribute with the same name. Note, it should only be used when you are sure you want the field to be searchable both in a full-text manner and as an attribute (with the ability to sort and group by it). If you just want to be able to fetch the original value of the field, you don't need to do anything for it unless you implicitly removed the field from the stored fields list via stored_fields.
sql_field_string = name
Declares a file based field.
This directive makes indexer interpret field contents as a file name, and load and process the referred file. Files larger than max_file_field_buffer in size are skipped. Any errors during the file loading (IO errors, missed limits, etc.) will be reported as indexing warnings and will not early terminate the indexing. No content will be indexed for such files.
sql_file_field = field_name
Joined/payload field fetch query. Multi-value, optional, the default is an empty list of queries.
sql_joined_field lets you use two different features: joined fields and payloads (payload fields). Its syntax is as follows:
sql_joined_field = FIELD-NAME 'from' ( 'query' | 'payload-query' | 'ranged-query' | 'ranged-main-query' ); \
QUERY [ ; RANGE-QUERY ]
where
Joined fields let you avoid JOIN and/or GROUP_CONCAT statements in the main document fetch query (sql_query). This can be useful when the SQL-side JOIN is slow, or needs to be offloaded on the Manticore side, or simply to emulate MySQL-specific GROUP_CONCAT functionality in case your database server does not support it.
The query must return exactly 2 columns: document ID, and text to append to a joined field. Document IDs can be duplicate, but they must be in ascending order. All the text rows fetched for a given ID will be concatenated together, and the concatenation result will be indexed as the entire contents of a joined field. Rows will be concatenated in the order returned from the query, and separating whitespace will be inserted between them. For instance, if the joined field query returns the following rows:
( 1, 'red' )
( 1, 'right' )
( 1, 'hand' )
( 2, 'mysql' )
( 2, 'manticore' )
then the indexing results would be equivalent to adding a new text field with a value of 'red right hand' to document 1 and 'mysql sphinx' to document 2, including the keyword positions inside the field in the order they come from the query. If the rows need to be in a specific order, that needs to be explicitly defined in the query.
Joined fields are only indexed differently. There are no other differences between joined fields and regular text fields.
Before executing the joined fields query, any set of sql_query_pre_all will be run, if any exist. This allows you to set the desired encoding, etc., within the joined fields' context.
When a single query is not efficient enough or does not work because of the database driver limitations, ranged queries can be used. It works similarly to the ranged queries in the main indexing loop. The range will be queried for and fetched upfront once, then multiple queries with different $start and $end substitutions will be run to fetch the actual data.
When using ranged-main-query query, omit the ranged-query, and it will automatically use the same query from sql_query_range (a useful option in complex inheritance setups to save having to manually duplicate the same query many times).
Payloads let you create a special field in which, instead of keyword positions, so-called user payloads are stored. Payloads are custom integer values attached to every keyword. They can then be used at search time to affect the ranking.
The payload query must return exactly 3 columns:
- document ID
- keyword
- and integer payload value.
Document IDs can be duplicate, but they must be in ascending order. Payloads must be unsigned integers within the 24-bit range, i.e., from 0 to 16777215.
The only ranker that accounts for payloads is proximity_bm25 (the default ranker). On tables with payload fields, it will automatically switch to a variant that matches keywords in those fields, computes a sum of matched payloads multiplied by field weights, and adds that sum to the final rank.
Please note that the payload field is ignored for full-text queries containing complex operators. It only works for simple bag-of-words queries.
Configuration example:
source min {
type = mysql
sql_host = localhost
sql_user = test
sql_pass =
sql_db = test
sql_query = select 1, 'Nike bag' f \
UNION select 2, 'Adidas bag' f \
UNION select 3, 'Reebok bag' f \
UNION select 4, 'Nike belt' f
sql_joined_field = tag from payload-query; select 1 id, 'nike' tag, 10 weight \
UNION select 4 id, 'nike' tag, 10 weight;
}
index idx {
path = idx
source = min
}
mysql> select * from idx;
+------+------------+------+
| id | f | tag |
+------+------------+------+
| 1 | Nike bag | nike |
| 2 | Adidas bag | |
| 3 | Reebok bag | |
| 4 | Nike belt | nike |
+------+------------+------+
4 rows in set (0.00 sec)
mysql> select *, weight() from idx where match('nike|adidas');
+------+------------+------+----------+
| id | f | tag | weight() |
+------+------------+------+----------+
| 1 | Nike bag | nike | 11539 |
| 4 | Nike belt | nike | 11539 |
| 2 | Adidas bag | | 1597 |
+------+------------+------+----------+
3 rows in set (0.01 sec)
mysql> select *, weight() from idx where match('"nike bag"|"adidas bag"');
+------+------------+------+----------+
| id | f | tag | weight() |
+------+------------+------+----------+
| 2 | Adidas bag | | 2565 |
| 1 | Nike bag | nike | 2507 |
+------+------------+------+----------+
2 rows in set (0.00 sec)
sql_column_buffers = <colname>=<size>[K|M] [, ...]
Per-column buffer sizes. Optional, default is empty (deduce the sizes automatically). Applies to odbc, mssql source types only.
ODBC and MS SQL drivers sometimes cannot return the maximum actual column size to be expected. For instance,NVARCHAR(MAX) columns always report their length as 2147483647 bytes to indexer even though the actually used length is likely considerably less. However, the receiving buffers still need to be allocated upfront, and their sizes have to be determined. When the driver does not report the column length at all, Manticore allocates default 1 KB buffers for each non-char column, and 1 MB buffers for each char column. Driver-reported column length also gets clamped by an upper limit of 8 MB, so in case the driver reports (almost) a 2 GB column length, it will be clamped and an 8 MB buffer will be allocated instead for that column. These hard-coded limits can be overridden using the sql_column_buffers directive, either in order to save memory on actually shorter columns or to overcome the 8 MB limit on actually longer columns. The directive values must be a comma-separated list of selected column names and sizes:
Example:
sql_query = SELECT id, mytitle, mycontent FROM documents
sql_column_buffers = mytitle=64K, mycontent=10M
Main query, which needs to fetch all the documents, can impose a read lock on the whole table and stall the concurrent queries (e.g. INSERTs to MyISAM table), waste a lot of memory for result set, etc. To avoid this, Manticore supports so-called ranged queries. With ranged queries, Manticore first fetches min and max document IDs from the table, and then substitutes different ID intervals into main query text and runs the modified query to fetch another chunk of documents. Here's an example.
Ranged query usage example:
sql_query_range = SELECT MIN(id),MAX(id) FROM documents
sql_range_step = 1000
sql_query = SELECT * FROM documents WHERE id>=$start AND id<=$end
If the table contains document IDs from 1 to, say, 2345, then sql_query would be run three times:
$start replaced with 1 and $end replaced with 1000;$start replaced with 1001 and $end replaced with 2000;$start replaced with 2001 and $end replaced with 2345.Obviously, that's not much of a difference for 2000-row table, but when it comes to indexing 10-million-row table, ranged queries might be of some help.
Defines the range query. The query specified in this option must fetch min and max document IDs that will be used as range boundaries. It must return exactly two integer fields, min ID first and max ID second; the field names are ignored. When enabled, sql_query will be required to contain $start and $end macros. Note that the intervals specified by $start..$end will not overlap, so you should not remove document IDs that are exactly equal to $start or $end from your query.
This directive defines the range query step. The default value is 1024.
This directive can be used to throttle the ranged query. By default, there is no throttling. Values for sql_ranged_throttle should be specified in milliseconds.
Throttling can be useful when the indexer imposes too much load on the database server. It causes the indexer to sleep for a given amount of time once per each ranged query step. This sleep is unconditional and is performed before the fetch query.
sql_ranged_throttle = 1000 # sleep for 1 sec before each query step
The xmlpipe2 source type allows for passing custom full-text and attribute data to Manticore in a custom XML format, with the schema (i.e., set of fields and attributes) specified in either the XML stream itself or in the source settings.
To declare the XML stream, the xmlpipe_command directive is mandatory and contains the shell command that produces the XML stream to be indexed. This can be a file, but it can also be a program that generates XML content on-the-fly.
When indexing an xmlpipe2 source, the indexer runs the specified command, opens a pipe to its stdout, and expects a well-formed XML stream.
Here's an example of what the XML stream data might look like:
<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
<sphinx:schema>
<sphinx:field name="subject"/>
<sphinx:field name="content"/>
<sphinx:attr name="published" type="timestamp"/>
<sphinx:attr name="author_id" type="int" bits="16" default="1"/>
</sphinx:schema>
<sphinx:document id="1234">
<content>this is the main content <![CDATA[and this <cdata> entry
must be handled properly by xml parser lib]]></content>
<published>1012325463</published>
<subject>note how field/attr tags can be
in <b> class="red">randomized</b> order</subject>
<misc>some undeclared element</misc>
</sphinx:document>
<sphinx:document id="1235">
<subject>another subject</subject>
<content>here comes another document, and i am given to understand,
that in-document field order must not matter, sir</content>
<published>1012325467</published>
</sphinx:document>
<!-- ... even more sphinx:document entries here ... -->
<sphinx:killlist>
<id>1234</id>
<id>4567</id>
</sphinx:killlist>
</sphinx:docset>
Arbitrary fields and attributes are allowed. They can also occur in the stream in arbitrary order within each document; the order is ignored. There is a restriction on the maximum field length; fields longer than 2 MB will be truncated to 2 MB (this limit can be changed in the source).
The schema, i.e., complete fields and attributes list, must be declared before any document can be parsed. This can be done either in the configuration file by using xmlpipe_field and xmlpipe_attr_XXX settings, or right in the stream using <sphinx:schema> element. <sphinx:schema> is optional. It is only allowed to occur as the very first sub-element in <sphinx:docset>. If there is no in-stream schema definition, settings from the configuration file will be used. Otherwise, stream settings take precedence. Note that the document id should be specified as a property id of tag <sphinx:document> (e.g. <sphinx:document id="1235">) and is supposed to be a unique-signed positive non-zero 64-bit integer.
Unknown tags (which were not declared neither as fields nor as attributes) will be ignored with a warning. In the example above, <misc> will be ignored. All embedded tags and their attributes (such as <strong> in <subject> in the example above) will be silently ignored.
Support for incoming stream encodings depends on whether iconv is installed on the system. xmlpipe2 is parsed using the libexpat parser, which understands US-ASCII, ISO-8859-1, UTF-8, and a few UTF-16 variants natively. Manticore's configure script will also check for libiconv presence and utilize it to handle other encodings. libexpat also enforces the requirement to use the UTF-8 charset on the Manticore side because the parsed data it returns is always in UTF-8.
XML elements (tags) recognized by xmlpipe2 (and their attributes where applicable) are:
sphinx:docset - Mandatory top-level element, denotes and contains the xmlpipe2 document set.sphinx:schema - Optional element, must either occur as the very first child of sphinx:docset or never occur at all. Declares the document schema and contains field and attribute declarations. If present, it overrides per-source settings from the configuration file.sphinx:field - Optional element, child of sphinx:schema. Declares a full-text field. Known attributes are:sphinx:attr - Optional element, child of sphinx:schema. Declares an attribute. Known attributes are:sphinx:document - Mandatory element, must be a child of sphinx:docset. Contains arbitrary other elements with field and attribute values to be indexed, as declared either using sphinx:field and sphinx:attr elements or in the configuration file. The only known attribute is "id" that must contain the unique integer document ID.sphinx:killlist - Optional element, child of sphinx:docset. Contains a number of "id" elements whose contents are document IDs to be put into a kill-list of the table. The kill-list is used in multi-table searches to suppress documents found in other tables of the search.If the XML doesn't define a schema, the data types of tables elements must be defined in the source configuration.
xmlpipe_field - declares a text field.xmlpipe_field_string - declares a text field/string attribute. The column will be both indexed as a text field but also stored as a string attribute.xmlpipe_attr_uint - declares an integer attributexmlpipe_attr_timestamp - declares a timestamp attributexmlpipe_attr_bool - declares a boolean attributexmlpipe_attr_float - declares a float attributexmlpipe_attr_bigint - declares a big integer attributexmlpipe_attr_multi - declares a multi-value attribute with integersxmlpipe_attr_multi_64 - declares a multi-value attribute with 64-bit integersxmlpipe_attr_string - declares a string attributexmlpipe_attr_json - declares a JSON attributeIf xmlpipe_fixup_utf8 is set it will enable Manticore-side UTF-8 validation and filtering to prevent XML parser from choking on non-UTF-8 documents. By default, this option is disabled.
Under certain occasions it might be hard or even impossible to guarantee that the incoming XMLpipe2 document bodies are in perfectly valid and conforming UTF-8 encoding. For instance, documents with national single-byte encodings could sneak into the stream. libexpat XML parser is fragile, meaning that it will stop processing in such cases. UTF8 fixup feature lets you avoid that. When fixup is enabled, Manticore will preprocess the incoming stream before passing it to the XML parser and replace invalid UTF-8 sequences with spaces.
xmlpipe_fixup_utf8 = 1
Example of XML source without schema in configuration:
source xml_test_1
{
type = xmlpipe2
xmlpipe_command = cat /tmp/products_today.xml
}
Example of XML source with schema in configuration:
source xml_test_2
{
type = xmlpipe2
xmlpipe_command = cat /tmp/products_today.xml
xmlpipe_field = subject
xmlpipe_field = content
xmlpipe_attr_timestamp = published
xmlpipe_attr_uint = author_id:16
}
TSV/CSV is the simplest way to pass data to the Manticore indexer. This method was created due to the limitations of xmlpipe2. In xmlpipe2, the indexer must map each attribute and field tag in the XML file to a corresponding schema element. This mapping requires time, and it increases with the number of fields and attributes in the schema. TSV/CSV has no such issue, as each field and attribute corresponds to a particular column in the TSV/CSV file. In some cases, TSV/CSV could work slightly faster than xmlpipe2.
The first column in TSV/CSV file must be a document ID. The rest columns must mirror the declaration of fields and attributes in the schema definition. Note that you don't need to declare the document ID in the schema, since it's always considered to be present, should be in the 1st column and needs to be a unique-signed positive non-zero 64-bit integer.
The difference between tsvpipe and csvpipe is delimiter and quoting rules. tsvpipe has a tab character as hardcoded delimiter and has no quoting rules. csvpipe has the csvpipe_delimiteroption for delimiter with a default value of , and also has quoting rules, such as:
tsvpipe_command directive is mandatory and contains the shell command invoked to produce the TSV stream that gets indexed. The command can read a TSV file, but it can also be a program that generates on-the-fly the tab delimited content.
The following directives can be used to declare the types of the indexed columns:
tsvpipe_field - declares a text field. tsvpipe_field_string - declares a text field/string attribute. The column will be both indexed as a text field but also stored as a string attribute.tsvpipe_attr_uint - declares an integer attribute. tsvpipe_attr_timestamp - declares a timestamp attribute.tsvpipe_attr_bool - declares a boolean attribute.tsvpipe_attr_float - declares a float attribute.tsvpipe_attr_bigint - declares a big integer attribute.tsvpipe_attr_multi - declares a multi-value attribute with integers.tsvpipe_attr_multi_64 - declares a multi-value attribute with 64-bit integers.tsvpipe_attr_string - declares a string attribute. tsvpipe_attr_json - declares a JSON attribute.Example of a source using a TSV file:
source tsv_test
{
type = tsvpipe
tsvpipe_command = cat /tmp/rock_bands.tsv
tsvpipe_field = name
tsvpipe_attr_multi = genre_tags
}
1 Led Zeppelin 35,23,16
2 Deep Purple 35,92
3 Frank Zappa 35,23,16,92,33,24
csvpipe_command directive is mandatory and contains the shell command invoked to produce the CSV stream which gets indexed. The command can just read a CSV file but it can also be a program that generates on-the-fly the comma delimited content.
The following directives can be used to declare the types of the indexed columns:
csvpipe_field - declares a text field. csvpipe_field_string - declares a text field/string attribute. The column will be both indexed as a text field but also stored as a string attribute.csvpipe_attr_uint - declares an integer attribute. csvpipe_attr_timestamp - declares a timestamp attribute.csvpipe_attr_bool - declares a boolean attribute.csvpipe_attr_float - declares a float attribute.csvpipe_attr_bigint - declares a big integer attribute.csvpipe_attr_multi - declares a multi-value attribute with integers.csvpipe_attr_multi_64 - declares a multi-value attribute with 64-bit integers.csvpipe_attr_string - declares a string attribute.csvpipe_attr_json - declares a JSON attribute.Example of a source using a CSV file:
source csv_test
{
type = csvpipe
csvpipe_command = cat /tmp/rock_bands.csv
csvpipe_field = name
csvpipe_attr_multi = genre_tags
}
1,"Led Zeppelin","35,23,16"
2,"Deep Purple","35,92"
3,"Frank Zappa","35,23,16,92,33,24"
In many situations, the total dataset is too large to be frequently rebuilt from scratch, while the number of new records remains relatively small. For example, a forum may have 1,000,000 archived posts but only receive 1,000 new posts per day.
In such cases, implementing "live" (nearly real-time) table updates can be achieved using a "main+delta" scheme.
The concept involves setting up two sources and two tables, with one "main" table for data that rarely changes (if ever), and one "delta" table for new documents. In the example, the 1,000,000 archived posts would be stored in the main table, while the 1,000 new daily posts would be placed in the delta table. The delta table can then be rebuilt frequently, making the documents available for searching within seconds or minutes. Determining which documents belong to which table and rebuilding the main table can be fully automated. One approach is to create a counter table that tracks the ID used to split the documents and update it whenever the main table is rebuilt.
Using a timestamp column as the split variable is more effective than using the ID since timestamps can track not only new documents but also modified ones.
For datasets that may contain modified or deleted documents, the delta table should provide a list of affected documents, ensuring they are suppressed and excluded from search queries. This is accomplished using a feature called Kill Lists. The document IDs to be killed can be specified in an auxiliary query defined by sql_query_killlist. The delta table must indicate the target tables for which the kill lists will be applied using the killlist_target directive. The impact of kill lists is permanent on the target table, meaning that even if a search is performed without the delta table, the suppressed documents will not appear in the search results.
Notice how we're overriding sql_query_pre in the delta source. We must explicitly include this override. If we don't, the REPLACE query would be executed during the delta source's build as well, effectively rendering it useless.
# in MySQL
CREATE TABLE deltabreaker (
index_name VARCHAR(50) NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (index_name)
);
# in manticore.conf
source main {
...
sql_query_pre = REPLACE INTO deltabreaker SET index_name = 'main', created_at = NOW()
sql_query = SELECT id, title, UNIX_TIMESTAMP(updated_at) AS updated FROM documents WHERE deleted=0 AND updated_at >=FROM_UNIXTIME($start) AND updated_at <=FROM_UNIXTIME($end)
sql_query_range = SELECT ( SELECT UNIX_TIMESTAMP(MIN(updated_at)) FROM documents) min, ( SELECT UNIX_TIMESTAMP(created_at)-1 FROM deltabreaker WHERE index_name='main') max
sql_query_post_index = REPLACE INTO deltabreaker set index_name = 'delta', created_at = (SELECT created_at FROM deltabreaker t WHERE index_name='main')
...
sql_attr_timestamp = updated
}
source delta : main {
sql_query_pre =
sql_query_range = SELECT ( SELECT UNIX_TIMESTAMP(created_at) FROM deltabreaker WHERE index_name='delta') min, UNIX_TIMESTAMP() max
sql_query_killlist = SELECT id FROM documents WHERE updated_at >= (SELECT created_at FROM deltabreaker WHERE index_name='delta')
}
table main {
path = /var/lib/manticore/main
source = main
}
table delta {
path = /var/lib/manticore/delta
source = delta
killlist_target = main:kl
}
Merging two existing plain tables can be more efficient than indexing the data from scratch and might be desired in some cases (such as merging 'main' and 'delta' tables instead of simply rebuilding 'main' in the 'main+delta' partitioning scheme). Thus,indexer provides an option to do that. Merging tables is typically faster than rebuilding, but still not instant for huge tables. Essentially, it needs to read the contents of both tables once and write the result once. Merging a 100 GB and 1 GB table, for example, will result in 202 GB of I/O (but that's still likely less than indexing from scratch requires).
The basic command syntax is as follows:
sudo -u manticore indexer --merge DSTINDEX SRCINDEX [--rotate] [--drop-src]
Unless --drop-src is specified, only the DSTINDEX table will be affected: the contents of SRCINDEX will be merged into it.
The --rotate switch is required if DSTINDEX is already being served by searchd.
The typical usage pattern is to merge a smaller update from SRCINDEX into DSTINDEX. Thus, when merging attributes, the values from SRCINDEX will take precedence if duplicate document IDs are encountered. However, note that the "old" keywords will not be automatically removed in such cases. For example, if there's a keyword "old" associated with document 123 in DSTINDEX, and a keyword "new" associated with it in SRCINDEX, document 123 will be found by both keywords after the merge. You can supply an explicit condition to remove documents from DSTINDEX to mitigate this; the relevant switch is --merge-dst-range:
sudo -u manticore indexer --merge main delta --merge-dst-range deleted 0 0
This switch allows you to apply filters to the destination table along with merging. There can be several filters; all of their conditions must be met in order to include the document in the resulting merged table. In the example above, the filter passes only those records where 'deleted' is 0, eliminating all records that were flagged as deleted.
--drop-src enables dropping SRCINDEX after the merge and before rotating the tables, which is important if you specify DSTINDEX in killlist_target of DSTINDEX. Otherwise, when rotating the tables, the documents that have been merged into DSTINDEX may be suppressed by SRCINDEX.
When using plain tables, there's a challenge arising from the need to have the data in the table as fresh as possible.
In this case, one or more secondary (also known as delta) tables are used to capture the modified data between the time the main table was created and the current time. The modified data can include new, updated, or deleted documents. The search becomes a search over the main table and the delta table. This works seamlessly when you just add new documents to the delta table, but when it comes to updated or deleted documents, there remains the following issue.
If a document is present in both the main and delta tables, it can cause issues during searching, as the engine will see two versions of a document and won't know how to pick the right one. So, the delta needs to somehow inform the search that there are deleted documents in the main table that should be disregarded. This is where kill lists come in.
A table can maintain a list of document IDs that can be used to suppress records in other tables. This feature is available for plain tables using database sources or plain tables using XML sources. In the case of database sources, the source needs to provide an additional query defined by sql_query_killlist. It will store in the table a list of documents that can be used by the server to remove documents from other plain tables.
This query is expected to return a number of 1-column rows, each containing just the document ID.
In many cases, the query is a union between a query that retrieves a list of updated documents and a list of deleted documents, e.g.:
sql_query_killlist = \
SELECT id FROM documents WHERE updated_ts>=@last_reindex UNION \
SELECT id FROM documents_deleted WHERE deleted_ts>=@last_reindex
A plain table can contain a directive called killlist_target that will tell the server it can provide a list of document IDs that should be removed from certain existing tables. The table can use either its document IDs as the source for this list or provide a separate list.
Sets the table(s) that the kill-list will be applied to. Optional, default value is empty.
When you use plain tables you often need to maintain not just a single table, but a set of them to be able to add/update/delete new documents sooner (read about delta table updates). n order to suppress matches in the previous (main) table that were updated or deleted in the next (delta) table, you need to:
killlist_target in delta table settings:table products {
killlist_target = main:kl
path = products
source = src_base
}
When killlist_target is specified, the kill-list is applied to all the tables listed in it on searchd startup. If any of the tables from killlist_target are rotated, the kill-list is reapplied to these tables. When the kill-list is applied, tables that were affected save these changes to disk.
killlist_target has 3 modes of operation:
killlist_target = main:kl. Document IDs from the kill-list of the delta table are suppressed in the main table (see sql_query_killlist).killlist_target = main:id. All document IDs from the delta table are suppressed in the main table. The kill-list is ignored.killlist_target = main. Both document IDs from the delta table and its kill-list are suppressed in the main table.Multiple targets can be specified, separated by commas like:
killlist_target = table_one:kl,table_two:kl
You can change the killlist_target settings for a table without rebuilding it by using ALTER.
However, since the 'old' main table has already written the changes to disk, the documents that were deleted in it will remain deleted even if it is no longer in the killlist_target of the delta table.
ALTER TABLE delta KILLLIST_TARGET='new_main_table:kl'
POST /cli -d "
ALTER TABLE delta KILLLIST_TARGET='new_main_table:kl'"
A plain table can be converted into a real-time table or added to an existing real-time table.
The first case is useful when you need to regenerate a real-time table completely, which may be needed, for example, if tokenization settings need an update. In this situation, preparing a plain table and converting it into a real-time table may be easier than preparing a batch job to perform INSERTs for adding all the data into a real-time table.
In the second case, you normally want to add a large bulk of new data to a real-time table, and again, creating a plain table with that data is easier than populating the existing real-time table.
The ATTACH statement allows you to convert a plain table to be attached to an existing real-time table. It also enables you to attach the content of one real-time table to another real-time table.
ATTACH TABLE plain_or_rt_table TO TABLE rt_table [WITH TRUNCATE]
ATTACH TABLE statement lets you move data from a plain table or a RT table to an RT table.
After a successful ATTACH the data originally stored in the source plain table becomes a part of the target RT table, and the source plain table becomes unavailable (until the next rebuild). If the source table is an RT table, its content is moved into the destination RT table, and the source RT table remains empty. ATTACH does not result in any table data changes. Essentially, it just renames the files (making the source table a new disk chunk of the target RT table) and updates the metadata. So it is generally a quick operation that might (frequently) complete as fast as under a second.
Note that when a table is attached to an empty RT table, the fields, attributes, and text processing settings (tokenizer, wordforms, etc.) from the source table are copied over and take effect. The respective parts of the RT table definition from the configuration file will be ignored.
When the TRUNCATE option is used, the RT table gets truncated prior to attaching the source plain table. This allows the operation to be atomic or ensures that the attached source plain table will be the only data in the target RT table.
ATTACH TABLE comes with a number of restrictions. Most notably, the target RT table is currently required to be either empty or have the same settings as the source table. In case the source table gets attached to a non-empty RT table, the RT table data collected so far gets stored as a regular disk chunk, and the table being attached becomes the newest disk chunk, with documents having the same IDs getting killed. The complete list of restrictions is as follows:
mysql> DESC rt;
Empty set (0.00 sec)
mysql> SELECT * FROM rt;
+-----------+---------+
| Field | Type |
+-----------+---------+
| id | integer |
| testfield | field |
| testattr | uint |
+-----------+---------+
3 rows in set (0.00 sec)
mysql> SELECT * FROM plain WHERE MATCH('test');
+------+--------+----------+------------+
| id | weight | group_id | date_added |
+------+--------+----------+------------+
| 1 | 1304 | 1 | 1313643256 |
| 2 | 1304 | 1 | 1313643256 |
| 3 | 1304 | 1 | 1313643256 |
| 4 | 1304 | 1 | 1313643256 |
+------+--------+----------+------------+
4 rows in set (0.00 sec)
mysql> ATTACH TABLE plain TO TABLE rt;
Query OK, 0 rows affected (0.00 sec)
mysql> DESC rt;
+------------+-----------+
| Field | Type |
+------------+-----------+
| id | integer |
| title | field |
| content | field |
| group_id | uint |
| date_added | timestamp |
+------------+-----------+
5 rows in set (0.00 sec)
mysql> SELECT * FROM rt WHERE MATCH('test');
+------+--------+----------+------------+
| id | weight | group_id | date_added |
+------+--------+----------+------------+
| 1 | 1304 | 1 | 1313643256 |
| 2 | 1304 | 1 | 1313643256 |
| 3 | 1304 | 1 | 1313643256 |
| 4 | 1304 | 1 | 1313643256 |
+------+--------+----------+------------+
4 rows in set (0.00 sec)
mysql> SELECT * FROM plain WHERE MATCH('test');
ERROR 1064 (42000): no enabled local indexes to search
If you decide to migrate from Plain mode to RT mode or in some other cases, real-time and percolate tables built in Plain mode can be imported to Manticore running in RT mode using the IMPORT TABLE statement. The general syntax is as follows:
IMPORT TABLE table_name FROM 'path'
where the 'path' parameter must be set as: /your_backup_folder/your_backup_name/data/your_table_name/your_table_name
mysql -P9306 -h0 -e 'create table t(f text)'
mysql -P9306 -h0 -e "backup table t to /tmp/"
mysql -P9306 -h0 -e "drop table t"
BACKUP_NAME=$(ls /tmp | grep 'backup-' | tail -n 1)
mysql -P9306 -h0 -e "import table t from '/tmp/$BACKUP_NAME/data/t/t'
mysql -P9306 -h0 -e "show tables"
Executing this command makes all the table files of the specified table copied to data_dir. All the external table files such as wordforms, exceptions and stopwords are also copied to the same data_dir.
IMPORT TABLE has the following limitations:
If the above method for migrating a plain table to an RT table is not possible, you may use indexer --print-rt to dump data from a plain table directly without the need to convert it to an RT type table and then import the dump into an RT table right from the command line.
This method has a few limitations though:
/usr/bin/indexer --rotate --config /etc/manticoresearch/manticore.conf --print-rt my_rt_index my_plain_index > /tmp/dump_regular.sql
mysql -P $9306 -h0 -e "truncate table my_rt_index"
mysql -P 9306 -h0 < /tmp/dump_regular.sql
rm /tmp/dump_regular.sql
Table rotation is a procedure in which the searchd server looks for new versions of defined tables in the configuration. Rotation is supported only in Plain mode of operation.
There can be two cases:
In the first case, the indexer cannot put the new version of the table online as the running copy is locked and loaded by searchd. In this case indexer needs to be called with the --rotate parameter. If rotate is used, indexer creates new table files with .new. in their names and sends a HUP signal to searchd informing it about the new version. The searchd will perform a lookup and will put the new version of the table in place and discard the old one. In some cases, it might be desired to create the new version of the table but not perform rotate as soon as possible. For example, it might be desired to first check the health of the new table versions. In this case, indexer can accept the--nohup parameter which will forbid sending the HUP signal to the server.
New tables can be loaded by rotation; however, the regular handling of the HUP signal is to check for new tables only if the configuration has changed since server startup. If the table was already defined in the configuration, the table should be first created by running indexer without rotation and perform the RELOAD TABLES statement instead.
There are also two specialized statements that can be used to perform rotations on tables:
RELOAD TABLE tbl [ FROM '/path/to/table_files' [ OPTION switchover=1 ] ];
The RELOAD TABLE command enables table rotation via SQL.
This command functions in three modes. In the first mode, without specifying a path, the Manticore server checks for new table files in the directory indicated by the path. New table files must be named as tbl.new.sp?.
If you specify a path, the server searches for table files in that directory, relocates them to the table path, renames them from tbl.sp? to tbl.new.sp?, and rotates them.
The third mode, activated by OPTION switchover=1, switches the index to the new path. Here, the daemon tries to load the table directly from the new path without moving the files. If loading is successful, this new index supersedes the old one.
Also, the daemon writes a unique link file (tbl.link) in the directory specified by path, maintaining persistent redirection.
If you revert a redirected index to the path specified in the configuration, the daemon will detect this and delete the link file.
Once redirected, the daemon retrieves the table from the newly linked path. When rotating, it looks for new table versions at the newly redirected path. Bear in mind, the daemon checks the configuration for common errors, like duplicate paths across different tables. However, it won't identify if multiple tables point to the same path via redirection. Under normal operations, tables are locked with the .spl file, but disabling the lock may cause problems. If there's an error (e.g., the path is inaccessible for any reason), you should manually correct (or simply delete) the link file.
indextool follows the link file, but other tools (indexer, index_converter, etc.) do not recognize the link file and consistently use the path defined in the configuration file, ignoring any redirection. Thus, you can inspect the index with indextool, and it will read from the new location. However, more complex operations like merging will not acknowledge any link file.
mysql> RELOAD TABLE plain_table;
mysql> RELOAD TABLE plain_table FROM '/home/mighty/new_table_files';
mysql> RELOAD TABLE plain_table FROM '/home/mighty/new/place/for/table/table_files' OPTION switchover=1;
RELOAD TABLES;
This command functions similarly to the HUP system signal, triggering a rotation of tables. Nevertheless, it doesn't exactly mirror the typical HUP signal (which can come from a kill -HUP command or indexer --rotate). This command actively searches for any tables needing rotation and is capable of re-reading the configuration. Suppose you launch Manticore in plain mode with a config file that points to a nonexistent plain table. If you then attempt to indexer --rotate the table, the new table won't be recognized by the server until you execute RELOAD TABLES or restart the server.
Depending on the value of the seamless_rotate setting, new queries might be shortly stalled, and clients will receive temporary errors.
mysql> RELOAD TABLES;
Query OK, 0 rows affected (0.01 sec)
The rotate assumes old table version is discarded and new table version is loaded and replaces the existing one. During this swapping, the server needs to also serve incoming queries made on the table that is going to be updated. To avoid stalls of the queries, the server implements a seamless rotate of the table by default, as described below.
Tables may contain data that needs to be precached in RAM. At the moment, .spa, .spb, .spi and .spm files are fully precached (they contain attribute data, blob attribute data, keyword table, and killed row map, respectively). Without seamless rotate, rotating a table tries to use as little RAM as possible and works as follows:
searchd waits for all currently running queries to finish.searchd resumes serving queries from the new table.However, if there's a lot of attribute or dictionary data, then the preloading step could take noticeable time - up to several minutes in case of preloading 1-5+ GB files.
With seamless rotate enabled, rotation works as follows:
Seamless rotate comes at the cost of higher peak memory usage during the rotation (because both old and new copies of .spa/.spb/.spi/.spm data need to be in RAM while preloading the new copy). However, average usage stays the same.
Example:
seamless_rotate = 1
You can modify existing data in an RT or PQ table by either updating or replacing it.
UPDATE replaces row-wise attribute values of existing documents with new values. Full-text fields and columnar attributes cannot be updated. If you need to change the content of a full-text field or columnar attributes, use REPLACE.
REPLACE works similarly to INSERT except that if an old document has the same ID as the new document, the old document is marked as deleted before the new document is inserted. Note that the old document does not get physically deleted from the table. The deletion can only happen when chunks are merged in a table, e.g., as a result of an OPTIMIZE.
REPLACE works similarly to INSERT, but it marks the previous document with the same ID as deleted before inserting a new one.
If you are looking for in-place updates, please see this section.
The syntax of the SQL REPLACE statement is as follows:
To replace the whole document:
REPLACE INTO table [(column1, column2, ...)]
VALUES (value1, value2, ...)
[, (...)]
To replace only selected fields:
REPLACE INTO table
SET field1=value1[, ..., fieldN=valueN]
WHERE id = <id>
Note, you can filter only by id in this mode.
See the examples for more details.
/replace:
POST /replace
{
"index": "<table name>",
"id": <document id>,
"doc":
{
"<field1>": <value1>,
...
"<fieldN>": <valueN>
}
}
/index is an alias endpoint and works the same.
Elasticsearch-like endpoint <table>/_doc/<id>:
PUT/POST /<table name>/_doc/<id>
{
"<field1>": <value1>,
...
"<fieldN>": <valueN>
}
Partial replace:
POST /<table name>/_update/<id>
{
"<field1>": <value1>,
...
"<fieldN>": <valueN>
}
See the examples for more details.
REPLACE INTO products VALUES(1, "document one", 10);
Query OK, 1 row affected (0.00 sec)
REPLACE INTO products SET description='HUAWEI Matebook 15', price=10 WHERE id = 55;
Query OK, 1 row affected (0.00 sec)
POST /replace
-H "Content-Type: application/x-ndjson" -d '
{
"index":"products",
"id":1,
"doc":
{
"title":"product one",
"price":10
}
}
'
{
"_index":"products",
"_id":1,
"created":false,
"result":"updated",
"status":200
}
PUT /products/_doc/2
{
"title": "product two",
"price": 20
}
POST /products/_doc/3
{
"title": "product three",
"price": 10
}
{
"_id":2,
"_index":"products",
"_primary_term":1,
"_seq_no":0,
"_shards":{
"failed":0,
"successful":1,
"total":1
},
"_type":"_doc",
"_version":1,
"result":"updated"
}
{
"_id":3,
"_index":"products",
"_primary_term":1,
"_seq_no":0,
"_shards":{
"failed":0,
"successful":1,
"total":1
},
"_type":"_doc",
"_version":1,
"result":"updated"
}
POST /products/_update/55
{
"doc": {
"description": "HUAWEI Matebook 15",
"price": 10
}
}
{
"_index":"products",
"updated":1
}
$index->replaceDocument([
'title' => 'document one',
'price' => 10
],1);
Array(
[_index] => products
[_id] => 1
[created] => false
[result] => updated
[status] => 200
)
indexApi.replace({"index" : "products", "id" : 1, "doc" : {"title" : "document one","price":10}})
{'created': False,
'found': None,
'id': 1,
'index': 'products',
'result': 'updated'}
res = await indexApi.replace({"index" : "products", "id" : 1, "doc" : {"title" : "document one","price":10}});
{"_index":"products","_id":1,"result":"updated"}
docRequest = new InsertDocumentRequest();
HashMap<String,Object> doc = new HashMap<String,Object>(){{
put("title","document one");
put("price",10);
}};
docRequest.index("products").id(1L).setDoc(doc);
sqlresult = indexApi.replace(docRequest);
class SuccessResponse {
index: products
id: 1
created: false
result: updated
found: null
}
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("title", "document one");
doc.Add("price", 10);
InsertDocumentRequest docRequest = new InsertDocumentRequest(index: "products", id: 1, doc: doc);
var sqlresult = indexApi.replace(docRequest);
class SuccessResponse {
index: products
id: 1
created: false
result: updated
found: null
}
res = await indexApi.replace({
index: 'test',
id: 1,
doc: { content: 'Text 11', name: 'Doc 11', cat: 3 },
});
{
"_index":"test",
"_id":1,
"created":false
"result":"updated"
"status":200
}
replaceDoc := map[string]interface{} {"content": "Text 11", "name": "Doc 11", "cat": 3}
replaceRequest := manticoreclient.NewInsertDocumentRequest("test", replaceDoc)
replaceRequest.SetId(1)
res, _, _ := apiClient.IndexAPI.Replace(context.Background()).InsertDocumentRequest(*replaceRequest).Execute()
{
"_index":"test",
"_id":1,
"created":false
"result":"updated"
"status":200
}
REPLACE is available for real-time and percolate tables. You can't replace data in a plain table.
When you run a REPLACE, the previous document is not removed, but it's marked as deleted, so the table size grows until chunk merging happens. To force a chunk merge, use the OPTIMIZE statement.
You can replace multiple documents at once. Check bulk adding documents for more information.
REPLACE INTO products(id,title,tag) VALUES (1, 'doc one', 10), (2,' doc two', 20);
Query OK, 2 rows affected (0.00 sec)
POST /bulk
-H "Content-Type: application/x-ndjson" -d '
{ "replace" : { "index" : "products", "id":1, "doc": { "title": "doc one", "tag" : 10 } } }
{ "replace" : { "index" : "products", "id":2, "doc": { "title": "doc two", "tag" : 20 } } }
'
{
"items":
[
{
"replace":
{
"_index":"products",
"_id":1,
"created":false,
"result":"updated",
"status":200
}
},
{
"replace":
{
"_index":"products",
"_id":2,
"created":false,
"result":"updated",
"status":200
}
}
],
"errors":false
}
$index->replaceDocuments([
[
'id' => 1,
'title' => 'document one',
'tag' => 10
],
[
'id' => 2,
'title' => 'document one',
'tag' => 20
]
);
Array(
[items] =>
Array(
Array(
[_index] => products
[_id] => 2
[created] => false
[result] => updated
[status] => 200
)
Array(
[_index] => products
[_id] => 2
[created] => false
[result] => updated
[status] => 200
)
)
[errors => false
)
indexApi = manticoresearch.IndexApi(client)
docs = [ \
{"replace": {"index" : "products", "id" : 1, "doc" : {"title" : "document one"}}}, \
{"replace": {"index" : "products", "id" : 2, "doc" : {"title" : "document two"}}} ]
api_resp = indexApi.bulk('\n'.join(map(json.dumps,docs)))
{'error': None,
'items': [{u'replace': {u'_id': 1,
u'_index': u'products',
u'created': False,
u'result': u'updated',
u'status': 200}},
{u'replace': {u'_id': 2,
u'_index': u'products',
u'created': False,
u'result': u'updated',
u'status': 200}}]}
docs = [
{"replace": {"index" : "products", "id" : 1, "doc" : {"title" : "document one"}}},
{"replace": {"index" : "products", "id" : 2, "doc" : {"title" : "document two"}}} ];
res = await indexApi.bulk(docs.map(e=>JSON.stringify(e)).join('\n'));
{"items":[{"replace":{"_index":"products","_id":1,"created":false,"result":"updated","status":200}},{"replace":{"_index":"products","_id":2,"created":false,"result":"updated","status":200}}],"errors":false}
body = "{\"replace\": {\"index\" : \"products\", \"id\" : 1, \"doc\" : {\"title\" : \"document one\"}}}" +"\n"+
"{\"replace\": {\"index\" : \"products\", \"id\" : 2, \"doc\" : {\"title\" : \"document two\"}}}"+"\n" ;
indexApi.bulk(body);
class BulkResponse {
items: [{replace={_index=products, _id=1, created=false, result=updated, status=200}}, {replace={_index=products, _id=2, created=false, result=updated, status=200}}]
error: null
additionalProperties: {errors=false}
}
string body = "{\"replace\": {\"index\" : \"products\", \"id\" : 1, \"doc\" : {\"title\" : \"document one\"}}}" +"\n"+
"{\"replace\": {\"index\" : \"products\", \"id\" : 2, \"doc\" : {\"title\" : \"document two\"}}}"+"\n" ;
indexApi.Bulk(body);
class BulkResponse {
items: [{replace={_index=products, _id=1, created=false, result=updated, status=200}}, {replace={_index=products, _id=2, created=false, result=updated, status=200}}]
error: null
additionalProperties: {errors=false}
}
replaceDocs = [
{
replace: {
index: 'test',
id: 1,
doc: { content: 'Text 11', cat: 1, name: 'Doc 11' },
},
},
{
replace: {
index: 'test',
id: 2,
doc: { content: 'Text 22', cat: 9, name: 'Doc 22' },
},
},
];
res = await indexApi.bulk(
replaceDocs.map((e) => JSON.stringify(e)).join("\n")
);
{
"items":
[
{
"replace":
{
"_index":"test",
"_id":1,
"created":false,
"result":"updated",
"status":200
}
},
{
"replace":
{
"_index":"test",
"_id":2,
"created":false,
"result":"updated",
"status":200
}
}
],
"errors":false
}
body := "{\"replace\": {\"index\": \"test\", \"id\": 1, \"doc\": {\"content\": \"Text 11\", \"name\": \"Doc 11\", \"cat\": 1 }}}" + "\n" +
"{\"replace\": {\"index\": \"test\", \"id\": 2, \"doc\": {\"content\": \"Text 22\", \"name\": \"Doc 22\", \"cat\": 9 }}}" +"\n";
res, _, _ := apiClient.IndexAPI.Bulk(context.Background()).Body(body).Execute()
{
"items":
[
{
"replace":
{
"_index":"test",
"_id":1,
"created":false,
"result":"updated",
"status":200
}
},
{
"replace":
{
"_index":"test",
"_id":2,
"created":false,
"result":"updated",
"status":200
}
}
],
"errors":false
}
UPDATE changes row-wise attribute values of existing documents in a specified table with new values. Note that you can't update the contents of a fulltext field or a columnar attribute. If there's such a need, use REPLACE.
Attribute updates are supported for RT, PQ, and plain tables. All attribute types can be updated as long as they are stored in the traditional row-wise storage.
Note that the document ID cannot be updated.
Note that when you update an attribute, its secondary index gets disabled, so consider replacing the document instead.
UPDATE products SET enabled=0 WHERE id=10;
Query OK, 1 row affected (0.00 sec)
POST /update
{
"index":"products",
"id":10,
"doc":
{
"enabled":0
}
}
{
"_index":"products",
"updated":1
}
$index->updateDocument([
'enabled'=>0
],10);
Array(
[_index] => products
[_id] => 10
[result] => updated
)
indexApi = api = manticoresearch.IndexApi(client)
indexApi.update({"index" : "products", "id" : 1, "doc" : {"price":10}})
{'id': 1, 'index': 'products', 'result': 'updated', 'updated': None}
res = await indexApi.update({"index" : "products", "id" : 1, "doc" : {"price":10}});
{"_index":"products","_id":1,"result":"updated"}
UpdateDocumentRequest updateRequest = new UpdateDocumentRequest();
doc = new HashMap<String,Object >(){{
put("price",10);
}};
updateRequest.index("products").id(1L).setDoc(doc);
indexApi.update(updateRequest);
class UpdateResponse {
index: products
updated: null
id: 1
result: updated
}
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("price", 10);
UpdateDocumentRequest updateRequest = new UpdateDocumentRequest(index: "products", id: 1, doc: doc);
indexApi.Update(updateRequest);
class UpdateResponse {
index: products
updated: null
id: 1
result: updated
}
res = await indexApi.update({ index: "test", id: 1, doc: { cat: 10 } });
{
"_index":"test",
"_id":1,
"result":"updated"
}
updateDoc = map[string]interface{} {"cat":10}
updateRequest = openapiclient.NewUpdateDocumentRequest("test", updateDoc)
updateRequest.SetId(1)
res, _, _ = apiClient.IndexAPI.Update(context.Background()).UpdateDocumentRequest(*updateRequest).Execute()
{
"_index":"test",
"_id":1,
"result":"updated"
}
Multiple attributes can be updated in a single statement. Example:
UPDATE products
SET price=100000000000,
coeff=3465.23,
tags1=(3,6,4),
tags2=()
WHERE MATCH('phone') AND enabled=1;
Query OK, 148 rows affected (0.0 sec)
POST /update
{
"index":"products",
"doc":
{
"price":100000000000,
"coeff":3465.23,
"tags1":[3,6,4],
"tags2":[]
},
"query":
{
"match": { "*": "phone" },
"equals": { "enabled": 1 }
}
}
{
"_index":"products",
"updated":148
}
$query= new BoolQuery();
$query->must(new Match('phone','*'));
$query->must(new Equals('enabled',1));
$index->updateDocuments([
'price' => 100000000000,
'coeff' => 3465.23,
'tags1' => [3,6,4],
'tags2' => []
],
$query
);
Array(
[_index] => products
[updated] => 148
)
indexApi = api = manticoresearch.IndexApi(client)
indexApi.update({"index" : "products", "id" : 1, "doc" : {
"price": 100000000000,
"coeff": 3465.23,
"tags1": [3,6,4],
"tags2": []}})
{'id': 1, 'index': 'products', 'result': 'updated', 'updated': None}
res = await indexApi.update({"index" : "products", "id" : 1, "doc" : {
"price": 100000000000,
"coeff": 3465.23,
"tags1": [3,6,4],
"tags2": []}});
{"_index":"products","_id":1,"result":"updated"}
UpdateDocumentRequest updateRequest = new UpdateDocumentRequest();
doc = new HashMap<String,Object >(){{
put("price",10);
put("coeff",3465.23);
put("tags1",new int[]{3,6,4});
put("tags2",new int[]{});
}};
updateRequest.index("products").id(1L).setDoc(doc);
indexApi.update(updateRequest);
class UpdateResponse {
index: products
updated: null
id: 1
result: updated
}
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("price", 10);
doc.Add("coeff", 3465.23);
doc.Add("tags1", new List<int> {3,6,4});
doc.Add("tags2", new List<int> {});
UpdateDocumentRequest updateRequest = new UpdateDocumentRequest(index: "products", id: 1, doc: doc);
indexApi.Update(updateRequest);
class UpdateResponse {
index: products
updated: null
id: 1
result: updated
}
res = await indexApi.update({ index: "test", id: 1, doc: { name: "Doc 21", cat: "10" } });
{
"_index":"test",
"_id":1,
"result":"updated"
}
updateDoc = map[string]interface{} {"name":"Doc 21", "cat":10}
updateRequest = manticoreclient.NewUpdateDocumentRequest("test", updateDoc)
updateRequest.SetId(1)
res, _, _ = apiClient.IndexAPI.Update(context.Background()).UpdateDocumentRequest(*updateRequest).Execute()
{
"_index":"test",
"_id":1,
"result":"updated"
}
When assigning out-of-range values to 32-bit attributes, they will be trimmed to their lower 32 bits without a prompt. For example, if you try to update the 32-bit unsigned int with a value of 4294967297, the value of 1 will actually be stored, because the lower 32 bits of 4294967297 (0x100000001 in hex) amount to 1 (0x00000001 in hex).
UPDATE can be used to perform partial JSON updates on numeric data types or arrays of numeric data types. Just make sure you don't update an integer value with a float value as it will be rounded off.
insert into products (id, title, meta) values (1,'title','{"tags":[1,2,3]}');
update products set meta.tags[0]=100 where id=1;
Query OK, 1 row affected (0.00 sec)
Query OK, 1 row affected (0.00 sec)
POST /insert
{
"index":"products",
"id":100,
"doc":
{
"title":"title",
"meta": {
"tags":[1,2,3]
}
}
}
POST /update
{
"index":"products",
"id":100,
"doc":
{
"meta.tags[0]":100
}
}
{
"_index":"products",
"_id":100,
"created":true,
"result":"created",
"status":201
}
{
"_index":"products",
"updated":1
}
$index->insertDocument([
'title' => 'title',
'meta' => ['tags' => [1,2,3]]
],1);
$index->updateDocument([
'meta.tags[0]' => 100
],1);
Array(
[_index] => products
[_id] => 1
[created] => true
[result] => created
)
Array(
[_index] => products
[updated] => 1
)
indexApi = api = manticoresearch.IndexApi(client)
indexApi.update({"index" : "products", "id" : 1, "doc" : {
"meta.tags[0]": 100}})
{'id': 1, 'index': 'products', 'result': 'updated', 'updated': None}
res = await indexApi.update({"index" : "products", "id" : 1, "doc" : {
"meta.tags[0]": 100}});
{"_index":"products","_id":1,"result":"updated"}
UpdateDocumentRequest updateRequest = new UpdateDocumentRequest();
doc = new HashMap<String,Object >(){{
put("meta.tags[0]",100);
}};
updateRequest.index("products").id(1L).setDoc(doc);
indexApi.update(updateRequest);
class UpdateResponse {
index: products
updated: null
id: 1
result: updated
}
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("meta.tags[0]", 100);
UpdateDocumentRequest updateRequest = new UpdateDocumentRequest(index: "products", id: 1, doc: doc);
indexApi.Update(updateRequest);
class UpdateResponse {
index: products
updated: null
id: 1
result: updated
}
res = await indexApi.update({"index" : "test", "id" : 1, "doc" : { "meta.tags[0]": 100} });
{"_index":"test","_id":1,"result":"updated"}
updateDoc = map[string]interface{} {"meta.tags[0]":100}
updateRequest = manticoreclient.NewUpdateDocumentRequest("test", updateDoc)
updateRequest.SetId(1)
res, _, _ = apiClient.IndexAPI.Update(context.Background()).UpdateDocumentRequest(*updateRequest).Execute()
{
"_index":"test",
"_id":1,
"result":"updated"
}
Updating other data types or changing property type in a JSON attribute requires a full JSON update.
insert into products values (1,'title','{"tags":[1,2,3]}');
update products set data='{"tags":["one","two","three"]}' where id=1;
Query OK, 1 row affected (0.00 sec)
Query OK, 1 row affected (0.00 sec)
POST /insert
{
"index":"products",
"id":1,
"doc":
{
"title":"title",
"data":"{\"tags\":[1,2,3]}"
}
}
POST /update
{
"index":"products",
"id":1,
"doc":
{
"data":"{\"tags\":[\"one\",\"two\",\"three\"]}"
}
}
{
"_index":"products",
"updated":1
}
$index->insertDocument([
'title'=> 'title',
'data' => [
'tags' => [1,2,3]
]
],1);
$index->updateDocument([
'data' => [
'one', 'two', 'three'
]
],1);
Array(
[_index] => products
[_id] => 1
[created] => true
[result] => created
)
Array(
[_index] => products
[updated] => 1
)
indexApi.insert({"index" : "products", "id" : 100, "doc" : {"title" : "title", "meta" : {"tags":[1,2,3]}}})
indexApi.update({"index" : "products", "id" : 100, "doc" : {"meta" : {"tags":['one','two','three']}}})
{'created': True,
'found': None,
'id': 100,
'index': 'products',
'result': 'created'}
{'id': 100, 'index': 'products', 'result': 'updated', 'updated': None}
res = await indexApi.insert({"index" : "products", "id" : 100, "doc" : {"title" : "title", "meta" : {"tags":[1,2,3]}}});
res = await indexApi.update({"index" : "products", "id" : 100, "doc" : {"meta" : {"tags":['one','two','three']}}});
{"_index":"products","_id":100,"created":true,"result":"created"}
{"_index":"products","_id":100,"result":"updated"}
InsertDocumentRequest newdoc = new InsertDocumentRequest();
doc = new HashMap<String,Object>(){{
put("title","title");
put("meta",
new HashMap<String,Object>(){{
put("tags",new int[]{1,2,3});
}});
}};
newdoc.index("products").id(100L).setDoc(doc);
indexApi.insert(newdoc);
updatedoc = new UpdateDocumentRequest();
doc = new HashMap<String,Object >(){{
put("meta",
new HashMap<String,Object>(){{
put("tags",new String[]{"one","two","three"});
}});
}};
updatedoc.index("products").id(100L).setDoc(doc);
indexApi.update(updatedoc);
class SuccessResponse {
index: products
id: 100
created: true
result: created
found: null
}
class UpdateResponse {
index: products
updated: null
id: 100
result: updated
}
Dictionary<string, Object> meta = new Dictionary<string, Object>();
meta.Add("tags", new List<int> {1,2,3});
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("title", "title");
doc.Add("meta", meta);
InsertDocumentRequest newdoc = new InsertDocumentRequest(index: "products", id: 100, doc: doc);
indexApi.Insert(newdoc);
meta = new Dictionary<string, Object>();
meta.Add("tags", new List<string> {"one","two","three"});
doc = new Dictionary<string, Object>();
doc.Add("meta", meta);
UpdateDocumentRequest updatedoc = new UpdateDocumentRequest(index: "products", id: 100, doc: doc);
indexApi.Update(updatedoc);
class SuccessResponse {
index: products
id: 100
created: true
result: created
found: null
}
class UpdateResponse {
index: products
updated: null
id: 100
result: updated
}
res = await indexApi.insert({
index: 'test',
id: 1,
doc: { content: 'Text 1', name: 'Doc 1', meta: { tags:[1,2,3] } }
})
res = await indexApi.update({ index: 'test', id: 1, doc: { meta: { tags:['one','two','three'] } } });
{
"_index":"test",
"_id":1,
"created":true,
"result":"created"
}
{
"_index":"test",
"_id":1,
"result":"updated"
}
metaField := map[string]interface{} {"tags": []int{1, 2, 3}}
insertDoc := map[string]interface{} {"name": "Doc 1", "meta": metaField}}
insertRequest := manticoreclient.NewInsertDocumentRequest("test", insertDoc)
insertRequest.SetId(1)
res, _, _ := apiClient.IndexAPI.Insert(context.Background()).InsertDocumentRequest(*insertRequest).Execute();
metaField = map[string]interface{} {"tags": []string{"one", "two", "three"}}
updateDoc := map[string]interface{} {"meta": metaField}
updateRequest := manticoreclient.NewUpdateDocumentRequest("test", updateDoc)
res, _, _ = apiClient.IndexAPI.Update(context.Background()).UpdateDocumentRequest(*updateRequest).Execute()
{
"_index":"test",
"_id":1,
"created":true,
"result":"created"
}
{
"_index":"test",
"_id":1,
"result":"updated"
}
When using replication, the table name should be prepended with cluster_name: (in SQL) so that updates will be propagated to all nodes in the cluster. For queries via HTTP, you should set a cluster property. See setting up replication for more information.
{
"cluster":"nodes4",
"index":"test",
"id":1,
"doc":
{
"gid" : 100,
"price" : 1000
}
}
update weekly:posts set enabled=0 where id=1;
POST /update
{
"cluster":"weekly",
"index":"products",
"id":1,
"doc":
{
"enabled":0
}
}
$index->setName('products')->setCluster('weekly');
$index->updateDocument(['enabled'=>0],1);
indexApi.update({"cluster":"weekly", "index" : "products", "id" : 1, "doc" : {"enabled" : 0}})
res = wait indexApi.update({"cluster":"weekly", "index" : "products", "id" : 1, "doc" : {"enabled" : 0}});
updatedoc = new UpdateDocumentRequest();
doc = new HashMap<String,Object >(){{
put("enabled",0);
}};
updatedoc.index("products").cluster("weekly").id(1L).setDoc(doc);
indexApi.update(updatedoc);
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("enabled", 0);
UpdateDocumentRequest updatedoc = new UpdateDocumentRequest(index: "products", cluster: "weekly", id: 1, doc: doc);
indexApi.Update(updatedoc);
res = wait indexApi.update( {cluster: 'test_cluster', index : 'test', id : 1, doc : {name : 'Doc 11'}} );
updateDoc = map[string]interface{} {"name":"Doc 11"}
updateRequest = manticoreclient.NewUpdateDocumentRequest("test", updateDoc)
updateRequest.SetCluster("test_cluster")
updateRequest.SetId(1)
res, _, _ = apiClient.IndexAPI.Update(context.Background()).UpdateDocumentRequest(*updateRequest).Execute()
Here is the syntax for the SQL UPDATE statement:
UPDATE table SET col1 = newval1 [, ...] WHERE where_condition [OPTION opt_name = opt_value [, ...]] [FORCE|IGNORE INDEX(id)]
where_condition has the same syntax as in the SELECT statement.
Multi-value attribute value sets must be specified as comma-separated lists in parentheses. To remove all values from a multi-value attribute, just assign () to it.
UPDATE products SET tags1=(3,6,4) WHERE id=1;
UPDATE products SET tags1=() WHERE id=1;
Query OK, 1 row affected (0.00 sec)
Query OK, 1 row affected (0.00 sec)
POST /update
{
"index":"products",
"_id":1,
"doc":
{
"tags1": []
}
}
{
"_index":"products",
"updated":1
}
$index->updateDocument(['tags1'=>[]],1);
Array(
[_index] => products
[updated] => 1
)
indexApi.update({"index" : "products", "id" : 1, "doc" : {"tags1": []}})
{'id': 1, 'index': 'products', 'result': 'updated', 'updated': None}
indexApi.update({"index" : "products", "id" : 1, "doc" : {"tags1": []}})
{"_index":"products","_id":1,"result":"updated"}
updatedoc = new UpdateDocumentRequest();
doc = new HashMap<String,Object >(){{
put("tags1",new int[]{});
}};
updatedoc.index("products").id(1L).setDoc(doc);
indexApi.update(updatedoc);
class UpdateResponse {
index: products
updated: null
id: 1
result: updated
}
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("tags1", new List<int> {});
UpdateDocumentRequest updatedoc = new UpdateDocumentRequest(index: "products", id: 1, doc: doc);
indexApi.Update(updatedoc);
class UpdateResponse {
index: products
updated: null
id: 1
result: updated
}
res = await indexApi.update({ index: 'test', id: 1, doc: { cat: 10 } });
{
"_index":"test",
"_id":1,
"result":"updated"
}
updateDoc = map[string]interface{} {"cat":10}
updateRequest = manticoreclient.NewUpdateDocumentRequest("test", updateDoc)
updateRequest.SetId(1)
res, _, _ = apiClient.IndexAPI.Update(context.Background()).UpdateDocumentRequest(*updateRequest).Execute()
{
"_index":"test",
"_id":1,
"result":"updated"
}
OPTION clause is a Manticore-specific extension that lets you control a number of per-update options. The syntax is:
OPTION <optionname>=<value> [ , ... ]
The options are the same as for the SELECT statement. Specifically for the UPDATE statement, you can use these options:
UPDATE will result in an error if the UPDATE query tries to perform an update on non-numeric properties. With strict=0, if multiple properties are updated and some are not allowed, the UPDATE will not result in an error and will perform the changes only on allowed properties (with the rest being ignored). If none of the SET changes of the UPDATE re permitted, the command will result in an error even with strict=0.In rare cases, Manticore's built-in query analyzer may be incorrect in understanding a query and determining whether a table by ID should be used. This can result in poor performance for queries like UPDATE ... WHERE id = 123.
For information on how to force the optimizer to use a docid index, see Query optimizer hints.
Updates using HTTP JSON protocol are performed via the /update endpoint. The syntax is similar to the /insert endpoint, but this time the doc property is mandatory.
The server will respond with a JSON object stating if the operation was successful or not.
POST /update
{
"index":"test",
"id":1,
"doc":
{
"gid" : 100,
"price" : 1000
}
}
{
"_index": "test",
"_id": 1,
"result": "updated"
}
The ID of the document that needs to be updated can be set directly using the id property, as shown in the previous example, or you can update documents by query and apply the update to all the documents that match the query:
POST /update
{
"index":"test",
"doc":
{
"price" : 1000
},
"query":
{
"match": { "*": "apple" }
}
}
{
"_index":"products",
"updated":1
}
The query syntax is the same as in the /search endpoint. Note that you can't specify id and query at the same time.
FLUSH ATTRIBUTES
The FLUSH ATTRIBUTES command flushes all in-memory attribute updates in all the active tables to disk. It returns a tag that identifies the result on-disk state, which represents the number of actual disk attribute saves performed since the server startup.
mysql> UPDATE testindex SET channel_id=1107025 WHERE id=1;
Query OK, 1 row affected (0.04 sec)
mysql> FLUSH ATTRIBUTES;
+------+
| tag |
+------+
| 1 |
+------+
1 row in set (0.19 sec)
See also attr_flush_period setting.
You can perform multiple update operations in a single call using the /bulk endpoint. This endpoint only works with data that has Content-Type set to application/x-ndjson. The data should be formatted as newline-delimited JSON (NDJSON). Essentially, this means that each line should contain exactly one JSON statement and end with a newline \n and, possibly, a \r.
POST /bulk
{ "update" : { "index" : "products", "id" : 1, "doc": { "price" : 10 } } }
{ "update" : { "index" : "products", "id" : 2, "doc": { "price" : 20 } } }
{
"items":
[
{
"update":
{
"_index":"products",
"_id":1,
"result":"updated"
}
},
{
"update":
{
"_index":"products",
"_id":2,
"result":"updated"
}
}
],
"errors":false
}
The /bulk endpoint supports inserts, replaces, and deletes. Each statement begins with an action type (in this case, update). Here's a list of the supported actions:
insert: Inserts a document. The syntax is the same as in the /insert endpoint.create: a synonym for insertreplace: Replaces a document. The syntax is the same as in the /replace.index: a synonym for replaceupdate: Updates a document. The syntax is the same as in the /update.delete: Deletes a document. The syntax is the same as in the /delete endpoint.Updates by query and deletes by query are also supported.
POST /bulk
{ "update" : { "index" : "products", "doc": { "coeff" : 1000 }, "query": { "range": { "price": { "gte": 1000 } } } } }
{ "update" : { "index" : "products", "doc": { "coeff" : 0 }, "query": { "range": { "price": { "lt": 1000 } } } } }
{
"items":
[
{
"update":
{
"_index":"products",
"updated":1
}
},
{
"update":
{
"_index":"products",
"updated":3
}
}
],
"errors":false
}
$client->bulk([
['update'=>[
'index' => 'products',
'doc' => [
'coeff' => 100
],
'query' => [
'range' => ['price'=>['gte'=>1000]]
]
]
],
['update'=>[
'index' => 'products',
'doc' => [
'coeff' => 0
],
'query' => [
'range' => ['price'=>['lt'=>1000]]
]
]
]
]);
Array(
[items] => Array (
Array(
[update] => Array(
[_index] => products
[updated] => 1
)
)
Array(
[update] => Array(
[_index] => products
[updated] => 3
)
)
)
docs = [ \
{ "update" : { "index" : "products", "doc": { "coeff" : 1000 }, "query": { "range": { "price": { "gte": 1000 } } } } }, \
{ "update" : { "index" : "products", "doc": { "coeff" : 0 }, "query": { "range": { "price": { "lt": 1000 } } } } } ]
indexApi.bulk('\n'.join(map(json.dumps,docs)))
{'error': None,
'items': [{u'update': {u'_index': u'products', u'updated': 1}},
{u'update': {u'_index': u'products', u'updated': 3}}]}
docs = [
{ "update" : { "index" : "products", "doc": { "coeff" : 1000 }, "query": { "range": { "price": { "gte": 1000 } } } } },
{ "update" : { "index" : "products", "doc": { "coeff" : 0 }, "query": { "range": { "price": { "lt": 1000 } } } } } ];
res = await indexApi.bulk(docs.map(e=>JSON.stringify(e)).join('\n'));
{"items":[{"update":{"_index":"products","updated":1}},{"update":{"_index":"products","updated":3}}],"errors":false}
String body = "{ \"update\" : { \"index\" : \"products\", \"doc\": { \"coeff\" : 1000 }, \"query\": { \"range\": { \"price\": { \"gte\": 1000 } } } }} "+"\n"+
"{ \"update\" : { \"index\" : \"products\", \"doc\": { \"coeff\" : 0 }, \"query\": { \"range\": { \"price\": { \"lt\": 1000 } } } } }"+"\n";
indexApi.bulk(body);
class BulkResponse {
items: [{update={_index=products, _id=1, created=false, result=updated, status=200}}, {update={_index=products, _id=2, created=false, result=updated, status=200}}]
error: null
additionalProperties: {errors=false}
}
string body = "{ \"update\" : { \"index\" : \"products\", \"doc\": { \"coeff\" : 1000 }, \"query\": { \"range\": { \"price\": { \"gte\": 1000 } } } }} "+"\n"+
"{ \"update\" : { \"index\" : \"products\", \"doc\": { \"coeff\" : 0 }, \"query\": { \"range\": { \"price\": { \"lt\": 1000 } } } } }"+"\n";
indexApi.Bulk(body);
class BulkResponse {
items: [{update={_index=products, _id=1, created=false, result=updated, status=200}}, {update={_index=products, _id=2, created=false, result=updated, status=200}}]
error: null
additionalProperties: {errors=false}
}
updateDocs = [
{
update: {
index: 'test',
id: 1,
doc: { content: 'Text 11', cat: 1, name: 'Doc 11' },
},
},
{
update: {
index: 'test',
id: 2,
doc: { content: 'Text 22', cat: 9, name: 'Doc 22' },
},
},
];
res = await indexApi.bulk(
updateDocs.map((e) => JSON.stringify(e)).join("\n")
);
{
"items":
[
{
"update":
{
"_index":"test",
"updated":1
}
},
{
"update":
{
"_index":"test",
"updated":1
}
}
],
"errors":false
}
body := "{\"update\": {\"index\": \"test\", \"id\": 1, \"doc\": {\"content\": \"Text 11\", \"name\": \"Doc 11\", \"cat\": 1 }}}" + "\n" +
"{\"update\": {\"index\": \"test\", \"id\": 2, \"doc\": {\"content\": \"Text 22\", \"name\": \"Doc 22\", \"cat\": 9 }}}" +"\n";
res, _, _ := apiClient.IndexAPI.Bulk(context.Background()).Body(body).Execute()
{
"items":
[
{
"update":
{
"_index":"test",
"updated":1
}
},
{
"update":
{
"_index":"test",
"updated":1
}
}
],
"errors":false
}
Keep in mind that the bulk operation stops at the first query that results in an error.
attr_update_reserve=size
attr_update_reserve is a per-table setting that determines the space reserved for blob attribute updates. This setting is optional, with a default value of 128k.
When blob attributes (MVAs, strings, JSON) are updated, their length may change. If the updated string (or MVA, or JSON) is shorter than the old one, it overwrites the old one in the .spb file. However, if the updated string is longer, updates are written to the end of the .spb file. This file is memory-mapped, which means resizing it may be a rather slow process, depending on the OS implementation of memory-mapped files.
To avoid frequent resizes, you can specify the extra space to be reserved at the end of the .spb file using this option.
create table products(title text, price float) attr_update_reserve = '1M'
POST /cli -d "
create table products(title text, price float) attr_update_reserve = '1M'"
$params = [
'body' => [
'settings' => [
'attr_update_reserve' => '1M'
],
'columns' => [
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
]
],
'index' => 'products'
];
$index = new \Manticoresearch\Index($client);
$index->create($params);
utilsApi.sql('create table products(title text, price float) attr_update_reserve = \'1M\'')
res = await utilsApi.sql('create table products(title text, price float) attr_update_reserve = \'1M\'');
utilsApi.sql("create table products(title text, price float) attr_update_reserve = '1M'");
utilsApi.Sql("create table products(title text, price float) attr_update_reserve = '1M'");
utilsApi.sql("create table test(content text, name string, cat int) attr_update_reserve = '1M'");
apiClient.UtilsAPI.Sql(context.Background()).Body("create table test(content text, name string, cat int) attr_update_reserve = '1M'").Execute()
table products {
attr_update_reserve = 1M
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
attr_flush_period = 900 # persist updates to disk every 15 minutes
When updating attributes the changes are first written to in-memory copy of attributes. This setting allows to set the interval between flushing the updates to disk. It defaults to 0, which disables the periodic flushing, but flushing will still occur at normal shut-down.
Deleting documents is only supported in RT mode for the following table types:
You can delete existing documents from a table based on either their ID or certain conditions.
Also, bulk deletion is available to delete multiple documents.
Deletion of documents can be accomplished via both SQL and JSON interfaces.
For SQL, the response for a successful operation will indicate the number of rows deleted.
For JSON, the json/delete endpoint is used. The server will respond with a JSON object indicating whether the operation was successful and the number of rows deleted.
It is recommended to use table truncation instead of deletion to delete all documents from a table, as it is a much faster operation.
In this example we delete all documents that match full-text query test document from the table named test:
mysql> SELECT * FROM TEST;
+------+------+-------------+------+
| id | gid | mva1 | mva2 |
+------+------+-------------+------+
| 100 | 1000 | 100,201 | 100 |
| 101 | 1001 | 101,202 | 101 |
| 102 | 1002 | 102,203 | 102 |
| 103 | 1003 | 103,204 | 103 |
| 104 | 1004 | 104,204,205 | 104 |
| 105 | 1005 | 105,206 | 105 |
| 106 | 1006 | 106,207 | 106 |
| 107 | 1007 | 107,208 | 107 |
+------+------+-------------+------+
8 rows in set (0.00 sec)
mysql> DELETE FROM TEST WHERE MATCH ('test document');
Query OK, 2 rows affected (0.00 sec)
mysql> SELECT * FROM TEST;
+------+------+-------------+------+
| id | gid | mva1 | mva2 |
+------+------+-------------+------+
| 100 | 1000 | 100,201 | 100 |
| 101 | 1001 | 101,202 | 101 |
| 102 | 1002 | 102,203 | 102 |
| 103 | 1003 | 103,204 | 103 |
| 104 | 1004 | 104,204,205 | 104 |
| 105 | 1005 | 105,206 | 105 |
+------+------+-------------+------+
6 rows in set (0.00 sec)
POST /delete -d '
{
"index":"test",
"query":
{
"match": { "*": "test document" }
}
}'
{
"_index":"test",
"deleted":2,
}
$index->deleteDocuments(new MatchPhrase('test document','*'));
Array(
[_index] => test
[deleted] => 2
)
indexApi.delete({"index" : "test", "query": { "match": { "*": "test document" }}})
{'deleted': 5, 'id': None, 'index': 'test', 'result': None}
res = await indexApi.delete({"index" : "test", "query": { "match": { "*": "test document" }}});
{"_index":"test","deleted":5}
DeleteDocumentRequest deleteRequest = new DeleteDocumentRequest();
query = new HashMap<String,Object>();
query.put("match",new HashMap<String,Object>(){{
put("*","test document");
}});
deleteRequest.index("test").setQuery(query);
indexApi.delete(deleteRequest);
class DeleteResponse {
index: test
deleted: 5
id: null
result: null
}
Dictionary<string, Object> match = new Dictionary<string, Object>();
match.Add("*", "test document");
Dictionary<string, Object> query = new Dictionary<string, Object>();
query.Add("match", match);
DeleteDocumentRequest deleteRequest = new DeleteDocumentRequest(index: "test", query: query);
indexApi.Delete(deleteRequest);
class DeleteResponse {
index: test
deleted: 5
id: null
result: null
}
res = await indexApi.delete({
index: 'test',
query: { match: { '*': 'test document' } },
});
{"_index":"test","deleted":5}
deleteRequest := manticoresearch.NewDeleteDocumentRequest("test")
matchExpr := map[string]interface{} {"*": "test document"}
deleteQuery := map[string]interface{} {"match": matchExpr }
deleteRequest.SetQuery(deleteQuery)
{"_index":"test","deleted":5}
Here - deleting a document with id equalling 1 from the table named test:
mysql> DELETE FROM TEST WHERE id=1;
Query OK, 1 rows affected (0.00 sec)
POST /delete -d '
{
"index": "test",
"id": 1
}'
{
"_index": "test",
"_id": 1,
"found": true,
"result": "deleted"
}
$index->deleteDocument(1);
Array(
[_index] => test
[_id] => 1
[found] => true
[result] => deleted
)
indexApi.delete({"index" : "test", "id" : 1})
{'deleted': None, 'id': 1, 'index': 'test', 'result': 'deleted'}
res = await indexApi.delete({"index" : "test", "id" : 1});
{"_index":"test","_id":1,"result":"deleted"}
DeleteDocumentRequest deleteRequest = new DeleteDocumentRequest();
deleteRequest.index("test").setId(1L);
indexApi.delete(deleteRequest);
class DeleteResponse {
index: test
_id: 1
result: deleted
}
DeleteDocumentRequest deleteRequest = new DeleteDocumentRequest(index: "test", id: 1);
indexApi.Delete(deleteRequest);
class DeleteResponse {
index: test
_id: 1
result: deleted
}
res = await indexApi.delete({ index: 'test', id: 1 });
{"_index":"test","_id":1,"result":"deleted"}
deleteRequest := manticoresearch.NewDeleteDocumentRequest("test")
deleteRequest.SetId(1)
{"_index":"test","_id":1,"result":"deleted"}
Here, documents with id matching values from the table named test are deleted:
Note that the delete forms with id=N or id IN (X,Y) are the fastest, as they delete documents without performing a search.
Also note that the response contains only the id of the first deleted document in the corresponding _id field.
DELETE FROM TEST WHERE id IN (1,2);
Query OK, 2 rows affected (0.00 sec)
POST /delete -d '
{
"index":"test",
"id": [1,2]
}'
{
"_index":"test",
"_id":1,
"found":true,
"result":"deleted"
}
$index->deleteDocumentsByIds([1,2]);
Array(
[_index] => test
[_id] => 1
[found] => true
[result] => deleted
)
Manticore SQL allows to use complex conditions for the DELETE statement.
For example here we are deleting documents that match full-text query test document and have attribute mva1 with a value greater than 206 or mva1 values 100 or 103 from table named test:
DELETE FROM TEST WHERE MATCH ('test document') AND ( mva1>206 or mva1 in (100, 103) );
SELECT * FROM TEST;
Query OK, 4 rows affected (0.00 sec)
+------+------+-------------+------+
| id | gid | mva1 | mva2 |
+------+------+-------------+------+
| 101 | 1001 | 101,202 | 101 |
| 102 | 1002 | 102,203 | 102 |
| 104 | 1004 | 104,204,205 | 104 |
| 105 | 1005 | 105,206 | 105 |
+------+------+-------------+------+
6 rows in set (0.00 sec)
Here is an example of deleting documents in cluster cluster's table test. Note that we must provide the cluster name property along with table property to delete a row from a table within a replication cluster:
delete from cluster:test where id=100;
POST /delete -d '
{
"cluster": "cluster",
"index": "test",
"id": 100
}'
$index->setCluster('cluster');
$index->deleteDocument(100);
Array(
[_index] => test
[_id] => 100
[found] => true
[result] => deleted
)
indexApi.delete({"cluster":"cluster","index" : "test", "id" : 1})
{'deleted': None, 'id': 1, 'index': 'test', 'result': 'deleted'}
indexApi.delete({"cluster":"cluster_1","index" : "test", "id" : 1})
{"_index":"test","_id":1,"result":"deleted"}
DeleteDocumentRequest deleteRequest = new DeleteDocumentRequest();
deleteRequest.cluster("cluster").index("test").setId(1L);
indexApi.delete(deleteRequest);
class DeleteResponse {
index: test
_id: 1
result: deleted
}
DeleteDocumentRequest deleteRequest = new DeleteDocumentRequest(index: "test", cluster: "cluster", id: 1);
indexApi.Delete(deleteRequest);
class DeleteResponse {
index: test
_id: 1
result: deleted
}
res = await indexApi.delete({ cluster: 'cluster_1', index: 'test', id: 1 });
{"_index":"test","_id":1,"result":"deleted"}
deleteRequest := manticoresearch.NewDeleteDocumentRequest("test")
deleteRequest.SetCluster("cluster_1")
deleteRequest.SetId(1)
{"_index":"test","_id":1,"result":"deleted"}
You can also perform multiple delete operations in a single call using the /bulk endpoint. This endpoint only works with data that has Content-Type set to application/x-ndjson. The data should be formatted as newline-delimited JSON (NDJSON). Essentially, this means that each line should contain exactly one JSON statement and end with a newline \n and, possibly, a \r.
POST /bulk
{ "delete" : { "index" : "test", "id" : 1 } }
{ "delete" : { "index" : "test", "query": { "equals": { "int_data" : 20 } } } }
{
"items":
[
{
"bulk":
{
"_index":"test",
"_id":0,
"created":0,
"deleted":2,
"updated":0,
"result":"created",
"status":201
}
}
],
"errors":false
}
$client->bulk([
['delete' => [
'index' => 'test',
'id' => 1
]
],
['delete'=>[
'index' => 'test',
'query' => [
'equals' => ['int_data' => 20]
]
]
]
]);
Array(
[items] => Array
(
[0] => Array
(
[bulk] => Array
(
[_index] => test
[_id] => 0
[created] => 0
[deleted] => 2
[updated] => 0
[result] => created
[status] => 201
)
)
)
[current_line] => 3
[skipped_lines] => 0
[errors] =>
[error] =>
)
docs = [ \
{ "delete" : { "index" : "test", "id": 1 } }, \
{ "delete" : { "index" : "test", "query": { "equals": { "int_data": 20 } } } } ]
indexApi.bulk('\n'.join(map(json.dumps,docs)))
{
'error': None,
'items': [{u'delete': {u'_index': test', u'deleted': 2}}]
}
docs = [
{ "delete" : { "index" : "test", "id": 1 } },
{ "delete" : { "index" : "test", "query": { "equals": { "int_data": 20 } } } } ];
res = await indexApi.bulk(docs.map(e=>JSON.stringify(e)).join('\n'));
{"items":[{"delete":{"_index":"test","deleted":2}}],"errors":false}
String body = "{ "delete" : { "index" : "test", "id": 1 } } "+"\n"+
"{ "delete" : { "index" : "test", "query": { "equals": { "int_data": 20 } } } }"+"\n";
indexApi.bulk(body);
class BulkResponse {
items: [{delete={_index=test, _id=0, created=false, deleted=2, result=created, status=200}}]
error: null
additionalProperties: {errors=false}
}
string body = "{ "delete" : { "index" : "test", "id": 1 } } "+"\n"+
"{ "delete" : { "index" : "test", "query": { "equals": { "int_data": 20 } } } }"+"\n";
indexApi.Bulk(body);
class BulkResponse {
items: [{replace={_index=test, _id=0, created=false, deleted=2, result=created, status=200}}]
error: null
additionalProperties: {errors=false}
}
docs = [
{ "delete" : { "index" : "test", "id": 1 } },
{ "delete" : { "index" : "test", "query": { "equals": { "int_data": 20 } } } }
];
body = await indexApi.bulk(
docs.map((e) => JSON.stringify(e)).join("\n")
);
res = await indexApi.bulk(body);
{"items":[{"delete":{"_index":"test","deleted":2}}],"errors":false}
docs = []string {
`{ "delete" : { "index" : "test", "id": 1 } }`,
`{ "delete" : { "index" : "test", "query": { "equals": { "int_data": 20 } } } }`
]
body = strings.Join(docs, "\n")
resp, httpRes, err := manticoreclient.IndexAPI.Bulk(context.Background()).Body(body).Execute()
{"items":[{"delete":{"_index":"test","deleted":2}}],"errors":false}
Manticore supports basic transactions for deleting and inserting data into real-time and percolate tables, except when attempting to write to a distributed table which includes a real-time or percolate table. Each change to a table is first saved in an internal changeset and then actually committed to the table. By default, each command is wrapped in an individual automatic transaction, making it transparent: you simply 'insert' something and can see the inserted result after it completes, without worrying about transactions. However, this behavior can be explicitly managed by starting and committing transactions manually.
Transactions are supported for the following commands:
Transactions are not supported for:
Please note that transactions in Manticore do not aim to provide isolation. The purpose of transactions in Manticore is to allow you to accumulate multiple writes and execute them all at once upon commit, or to roll them all back if necessary. Transactions are integrated with binary log for durability and consistency.
SET AUTOCOMMIT = {0 | 1}
SET AUTOCOMMIT controls the autocommit mode in the active session. AUTOCOMMIT is set to 1 by default. With the default setting, you don't have to worry about transactions, as every statement that makes any changes to any table is implicitly wrapped in a separate transaction. Setting it to 0 allows you to manage transactions manually, meaning they will not be visible until you explicitly commit them.
Transactions are limited to a single real-time or percolate table and are also limited in size. They are atomic, consistent, overly isolated, and durable. Overly isolated means that the changes are not only invisible to concurrent transactions but even to the current session itself.
START TRANSACTION | BEGIN
COMMIT
ROLLBACK
The BEGIN statement (or its START TRANSACTION alias) forcibly commits any pending transaction, if present, and starts a new one.
The COMMIT statement commits the current transaction, making all its changes permanent.
The ROLLBACK statement rolls back the current transaction, canceling all its changes.
When using one of the /bulk JSON endpoints ( bulk insert, bulk replace, bulk delete ), you can force a batch of documents to be committed by adding an empty line after them.
insert into indexrt (id, content, title, channel_id, published) values (1, 'aa', 'blabla', 1, 10);
Query OK, 1 rows affected (0.00 sec)
select * from indexrt where id=1;
+------+------------+-----------+--------+
| id | channel_id | published | title |
+------+------------+-----------+--------+
| 1 | 1 | 10 | blabla |
+------+------------+-----------+--------+
1 row in set (0.00 sec)
The inserted value is immediately visible in the following 'select' statement.
set autocommit=0;
Query OK, 0 rows affected (0.00 sec)
insert into indexrt (id, content, title, channel_id, published) values (3, 'aa', 'bb', 1, 1);
Query OK, 1 row affected (0.00 sec)
insert into indexrt (id, content, title, channel_id, published) values (4, 'aa', 'bb', 1, 1);
Query OK, 1 row affected (0.00 sec)
select * from indexrt where id=3;
Empty set (0.01 sec)
select * from indexrt where id=4;
Empty set (0.00 sec)
In this case, changes are NOT automatically committed. As a result, the insertions are not visible, even in the same session, since they have not been committed. Also, despite the absence of a BEGIN statement, a transaction is implicitly started.
To make the changes visible, you need to commit the transaction:
commit;
Query OK, 0 rows affected (0.00 sec)
select * from indexrt where id=4;
+------+------------+-----------+-------+
| id | channel_id | published | title |
+------+------------+-----------+-------+
| 4 | 1 | 1 | bb |
+------+------------+-----------+-------+
1 row in set (0.00 sec)
select * from indexrt where id=3;
+------+------------+-----------+-------+
| id | channel_id | published | title |
+------+------------+-----------+-------+
| 3 | 1 | 1 | bb |
+------+------------+-----------+-------+
1 row in set (0.00 sec)
After the commit statement, the insertions are visible in the table.
By using BEGIN and COMMIT, you can define the bounds of a transaction explicitly, so there's no need to worry about autocommit in this case.
begin;
Query OK, 0 rows affected (0.00 sec)
insert into indexrt (id, content, title, channel_id, published) values (2, 'aa', 'bb', 1, 1);
Query OK, 1 row affected (0.00 sec)
select * from indexrt where id=2;
Empty set (0.01 sec)
commit;
Query OK, 0 rows affected (0.01 sec)
select * from indexrt where id=2;
+------+------------+-----------+-------+
| id | channel_id | published | title |
+------+------------+-----------+-------+
| 2 | 1 | 1 | bb |
+------+------------+-----------+-------+
1 row in set (0.01 sec)
Searching is a core feature of Manticore Search. You can:
SQL:
SELECT ... [OPTION <optionname>=<value> [ , ... ]]
HTTP:
POST /search
{
"index" : "index_name",
"options":
{
...
}
}
The MATCH clause allows for full-text searches in text fields. The input query string is tokenized using the same settings applied to the text during indexing. In addition to the tokenization of input text, the query string supports a number of full-text operators that enforce various rules on how keywords should provide a valid match.
Full-text match clauses can be combined with attribute filters as an AND boolean. OR relations between full-text matches and attribute filters are not supported.
The match query is always executed first in the filtering process, followed by the attribute filters. The attribute filters are applied to the result set of the match query. A query without a match clause is called a fullscan.
There must be at most one MATCH() in the SELECT clause.
Using the full-text query syntax, matching is performed across all indexed text fields of a document, unless the expression requires a match within a field (like phrase search) or is limited by field operators.
SELECT * FROM myindex WHERE MATCH('cats|birds');
The SELECT statement uses a MATCH clause, which must come after WHERE, for performing full-text searches. MATCH() accepts an input string in which all full-text operators are available.
SELECT * FROM myindex WHERE MATCH('"find me fast"/2');
+------+------+----------------+
| id | gid | title |
+------+------+----------------+
| 1 | 11 | first find me |
| 2 | 12 | second find me |
+------+------+----------------+
2 rows in set (0.00 sec)
SELECT * FROM myindex WHERE MATCH('cats|birds') AND (`title`='some title' AND `id`=123);
Full-text matching is available in the /search endpoint and in HTTP-based clients. The following clauses can be used for performing full-text matches:
"match" is a simple query that matches the specified keywords in the specified fields.
"query":
{
"match": { "field": "keyword" }
}
You can specify a list of fields:
"match":
{
"field1,field2": "keyword"
}
Or you can use _all or * to search all fields.
You can search all fields except one using "!field":
"match":
{
"!field1": "keyword"
}
By default, keywords are combined using the OR operator. However, you can change that behavior using the "operator" clause:
"query":
{
"match":
{
"content,title":
{
"query":"keyword",
"operator":"or"
}
}
}
"operator" can be set to "or" or "and".
"match_phrase" is a query that matches the entire phrase. It is similar to a phrase operator in SQL. Here's an example:
"query":
{
"match_phrase": { "_all" : "had grown quite" }
}
"query_string" accepts an input string as a full-text query in MATCH() syntax.
"query":
{
"query_string": "Church NOTNEAR/3 street"
}
"match_all" accepts an empty object and returns documents from the table without performing any attribute filtering or full-text matching. Alternatively, you can just omit the query clause in the request which will have the same effect.
"query":
{
"match_all": {}
}
All full-text match clauses can be combined with must, must_not, and should operators of a JSON bool query.
Examples:
// POST /search -d
{
"index" : "hn_small",
"query":
{
"match":
{
"*" : "find joe"
}
},
"_source": ["story_author","comment_author"],
"limit": 1
}
{
"took" : 3,
"timed_out" : false,
"hits" : {
"hits" : [
{
"_id" : "668018",
"_score" : 3579,
"_source" : {
"story_author" : "IgorPartola",
"comment_author" : "joe_the_user"
}
}
],
"total" : 88063,
"total_relation" : "eq"
}
}
POST /search
-d
'{
"index" : "hn_small",
"query":
{
"match_phrase":
{
"*" : "find joe"
}
},
"_source": ["story_author","comment_author"],
"limit": 1
}'
{
"took" : 3,
"timed_out" : false,
"hits" : {
"hits" : [
{
"_id" : "807160",
"_score" : 2599,
"_source" : {
"story_author" : "rbanffy",
"comment_author" : "runjake"
}
}
],
"total" : 2,
"total_relation" : "eq"
}
}
POST /search
-d
'{ "index" : "hn_small",
"query":
{
"query_string": "@comment_text \"find joe fast \"/2"
},
"_source": ["story_author","comment_author"],
"limit": 1
}'
{
"took" : 3,
"timed_out" : false,
"hits" : {
"hits" : [
{
"_id" : "807160",
"_score" : 2566,
"_source" : {
"story_author" : "rbanffy",
"comment_author" : "runjake"
}
}
],
"total" : 1864,
"total_relation" : "eq"
}
}
$search = new Search(new Client());
$result = $search->('@title find me fast');
foreach($result as $doc)
{
echo 'Document: '.$doc->getId();
foreach($doc->getData() as $field=>$value)
{
echo $field.': '.$value;
}
}
Document: 1
title: first find me fast
gid: 11
Document: 2
title: second find me fast
gid: 12
searchApi.search({"index":"hn_small","query":{"query_string":"@comment_text \"find joe fast \"/2"}, "_source": ["story_author","comment_author"], "limit":1})
{'aggregations': None,
'hits': {'hits': [{'_id': '807160',
'_score': 2566,
'_source': {'comment_author': 'runjake',
'story_author': 'rbanffy'}}],
'max_score': None,
'total': 1864,
'total_relation': 'eq'},
'profile': None,
'timed_out': False,
'took': 2,
'warning': None}
res = await searchApi.search({"index":"hn_small","query":{"query_string":"@comment_text \"find joe fast \"/2"}, "_source": ["story_author","comment_author"], "limit":1});
{
took: 1,
timed_out: false,
hits: {
exports: {
total: 1864,
total_relation: 'eq',
hits: [
{
_id: '807160',
_score: 2566,
_source: { story_author: 'rbanffy', comment_author: 'runjake' }
}
]
}
}
}
query = new HashMap<String,Object>();
query.put("query_string", "@comment_text \"find joe fast \"/2");
searchRequest = new SearchRequest();
searchRequest.setIndex("hn_small");
searchRequest.setQuery(query);
searchRequest.addSourceItem("story_author");
searchRequest.addSourceItem("comment_author");
searchRequest.limit(1);
searchResponse = searchApi.search(searchRequest);
class SearchResponse {
took: 1
timedOut: false
aggregations: null
hits: class SearchResponseHits {
maxScore: null
total: 1864
totalRelation: eq
hits: [{_id=807160, _score=2566, _source={story_author=rbanffy, comment_author=runjake}}]
}
profile: null
warning: null
}
object query = new { query_string="@comment_text \"find joe fast \"/2" };
var searchRequest = new SearchRequest("hn_small", query);
searchRequest.Source = new List<string> {"story_author", "comment_author"};
searchRequest.Limit = 1;
SearchResponse searchResponse = searchApi.Search(searchRequest);
class SearchResponse {
took: 1
timedOut: false
aggregations: null
hits: class SearchResponseHits {
maxScore: null
total: 1864
totalRelation: eq
hits: [{_id=807160, _score=2566, _source={story_author=rbanffy, comment_author=runjake}}]
}
profile: null
warning: null
}
res = await searchApi.search({
index: 'test',
query: { query_string: "test document 1" },
"_source": ["content", "title"],
limit: 1
});
{
took: 1,
timed_out: false,
hits:
exports {
total: 5,
total_relation: 'eq',
hits:
[ { _id: '1',
_score: 2566,
_source: { content: 'This is a test document 1', title: 'Doc 1' }
}
]
}
}
searchRequest := manticoresearch.NewSearchRequest("test")
query := map[string]interface{} {"query_string": "test document 1"}
searchReq.SetSource([]string{"content", "title"})
searchReq.SetLimit(1)
resp, httpRes, err := search.SearchRequest(*searchRequest).Execute()
{
"hits": {
"hits": [
{
"_id": "1",
"_score": 2566,
"_source": {
"content": "This is a test document 1",
"title": "Doc 1"
}
}
],
"total": 5,
"total_relation": "eq"
},
"timed_out": false,
"took": 0
}
The query string can include specific operators that define the conditions for how the words from the query string should be matched.
An implicit logical AND operator is always present, so "hello world" implies that both "hello" and "world" must be found in the matching document.
hello world
Note: There is no explicit AND operator.
The logical OR operator | has a higher precedence than AND, so looking for cat | dog | mouse means looking for (cat | dog | mouse) rather than (looking for cat) | dog | mouse.
hello | world
Note: There is no operator OR. Please use | instead.
hello MAYBE world
The MAYBE operator functions similarly to the | operator, but it does not return documents that match only the right subtree expression.
hello -world
hello !world
The negation operator enforces a rule for a word to not exist.
Queries containing only negations are not supported by default. To enable, use the server option not_terms_only_allowed.
@title hello @body world
The field limit operator restricts subsequent searches to a specified field. By default, the query will fail with an error message if the given field name does not exist in the searched table. However, this behavior can be suppressed by specifying the @@relaxed option at the beginning of the query:
@@relaxed @nosuchfield my query
This can be useful when searching through heterogeneous tables with different schemas.
Field position limits additionally constrain the search to the first N positions within a given field (or fields). For example, @body [50] hello will not match documents where the keyword hello appears at position 51 or later in the body.
@body[50] hello
Multiple-field search operator:
@(title,body) hello world
Ignore field search operator (ignores any matches of 'hello world' from the 'title' field):
@!title hello world
Ignore multiple-field search operator (if there are fields 'title', 'subject', and 'body', then @!(title) is equivalent to @(subject,body)):
@!(title,body) hello world
All-field search operator:
@* hello
"hello world"
The phrase operator mandates that the words be adjacent to each other.
The phrase search operator can incorporate a match any term modifier. Within the phrase operator, terms are positionally significant. When the 'match any term' modifier is employed, the positions of the subsequent terms in that phrase query will be shifted. As a result, the 'match any' modifier does not affect search performance.
"exact * phrase * * for terms"
"hello world"~10
Proximity distance is measured in words, accounting for word count, and applies to all words within quotes. For example, the query "cat dog mouse"~5 indicates that there must be a span of fewer than 8 words containing all 3 words. Therefore, a document with CAT aaa bbb ccc DOG eee fff MOUSE will not match this query, as the span is exactly 8 words long.
"the world is a wonderful place"/3
The quorum matching operator introduces a type of fuzzy matching. It will match only those documents that meet a given threshold of specified words. In the example above ("the world is a wonderful place"/3), it will match all documents containing at least 3 of the 6 specified words. The operator is limited to 255 keywords. Instead of an absolute number, you can also provide a value between 0.0 and 1.0 (representing 0% and 100%), and Manticore will match only documents containing at least the specified percentage of given words. The same example above could also be expressed as "the world is a wonderful place"/0.5, and it would match documents with at least 50% of the 6 words.
aaa << bbb << ccc
The strict order operator (also known as the "before" operator) matches a document only if its argument keywords appear in the document precisely in the order specified in the query. For example, the query black << cat will match the document "black and white cat" but not the document "that cat was black". The order operator has the lowest priority. It can be applied to both individual keywords and more complex expressions. For instance, this is a valid query:
(bag of words) << "exact phrase" << red|green|blue
raining =cats and =dogs
="exact phrase"
The exact form keyword modifier matches a document only if the keyword appears in the exact form specified. By default, a document is considered a match if the stemmed/lemmatized keyword matches. For instance, the query "runs" will match both a document containing "runs" and one containing "running", because both forms stem to just "run". However, the =runs query will only match the first document. The exact form operator requires the index_exact_words option to be enabled.
Another use case is to prevent expanding a keyword to its *keyword* form. For example, with index_exact_words=1 + expand_keywords=1/star, bcd will find a document containing abcde, but =bcd will not.
As a modifier affecting the keyword, it can be used within operators such as phrase, proximity, and quorum operators. Applying an exact form modifier to the phrase operator is possible, and in this case, it internally adds the exact form modifier to all terms in the phrase.
nation* *nation* *national
Requires min_infix_len for prefix (expansion in trail) and/or suffix (expansion in head). If only prefixing is desired, min_prefix_len can be used instead.
The search will attempt to find all expansions of the wildcarded tokens, and each expansion is recorded as a matched hit. The number of expansions for a token can be controlled with the expansion_limit table setting. Wildcarded tokens can have a significant impact on query search time, especially when tokens have short lengths. In such cases, it is desirable to use the expansion limit.
The wildcard operator can be automatically applied if the expand_keywords table setting is used.
In addition, the following inline wildcard operators are supported:
? can match any single character: t?st will match test, but not teast% can match zero or one character: tes% will match tes or test, but not testingThe inline operators require dict=keywords and infixing enabled.
REGEX(/t.?e/)
Requires the min_infix_len or min_prefix_len and dict=keywords options to be set (which is a default).
Similarly to the wildcard operators, the REGEX operator attempts to find all tokens matching the provided pattern, and each expansion is recorded as a matched hit. Note, this can have a significant impact on query search time, as the entire dictionary is scanned, and every term in the dictionary undergoes matching with the REGEX pattern.
The patterns should adhere to the RE2 syntax. The REGEX expression delimiter is the first symbol after the open bracket. In other words, all text between the open bracket followed by the delimiter and the delimiter and the closed bracket is considered as a RE2 expression.
Please note that the terms stored in the dictionary undergo charset_table transformation, meaning that for example, REGEX may not be able to match uppercase characters if all characters are lowercased according to the charset_table (which happens by default). To successfully match a term using a REGEX expression, the pattern must correspond to the entire token. To achieve partial matching, place .* at the beginning and/or end of your pattern.
REGEX(/.{3}t/)
REGEX(/t.*\d*/)
^hello world$
Field-start and field-end keyword modifiers ensure that a keyword only matches if it appears at the very beginning or the very end of a full-text field, respectively. For example, the query "^hello world$" (enclosed in quotes to combine the phrase operator with the start/end modifiers) will exclusively match documents containing at least one field with these two specific keywords.
boosted^1.234 boostedfieldend$^1.234
The boost modifier raises the word IDF score by the indicated factor in ranking scores that incorporate IDF into their calculations. It does not impact the matching process in any manner.
hello NEAR/3 world NEAR/4 "my test"
The NEAR operator is a more generalized version of the proximity operator. Its syntax is NEAR/N, which is case-sensitive and does not allow spaces between the NEAR keywords, slash sign, and distance value.
While the original proximity operator works only on sets of keywords, NEAR is more versatile and can accept arbitrary subexpressions as its two arguments. It matches a document when both subexpressions are found within N words of each other, regardless of their order. NEAR is left-associative and shares the same (lowest) precedence as BEFORE.
It is important to note that one NEAR/7 two NEAR/7 three is not exactly equivalent to "one two three"~7. The key difference is that the proximity operator allows up to 6 non-matching words between all three matching words, while the version with NEAR is less restrictive: it permits up to 6 words between one and two, and then up to 6 more between that two-word match and three.
Church NOTNEAR/3 street
The NOTNEAR operator serves as a negative assertion. It matches a document when the left argument is present and either the right argument is absent from the document or the right argument is a specified distance away from the end of the left matched argument. The distance is denoted in words. The syntax is NOTNEAR/N, which is case-sensitive and does not permit spaces between the NOTNEAR keyword, slash sign, and distance value. Both arguments of this operator can be terms or any operators or group of operators.
all SENTENCE words SENTENCE "in one sentence"
"Bill Gates" PARAGRAPH "Steve Jobs"
The SENTENCE and PARAGRAPH operators match a document when both of their arguments are within the same sentence or the same paragraph of text, respectively. These arguments can be keywords, phrases, or instances of the same operator.
The order of the arguments within the sentence or paragraph is irrelevant. These operators function only with tables built with index_sp (sentence and paragraph indexing feature) enabled and revert to a simple AND operation otherwise. For information on what constitutes a sentence and a paragraph, refer to the index_sp directive documentation.
ZONE:(h3,h4)
only in these titles
The ZONE limit operator closely resembles the field limit operator but limits matching to a specified in-field zone or a list of zones. It is important to note that subsequent subexpressions do not need to match within a single continuous span of a given zone and may match across multiple spans. For example, the query (ZONE:th hello world) will match the following sample document:
<th>Table 1. Local awareness of Hello Kitty brand.</th>
.. some table data goes here ..
<th>Table 2. World-wide brand awareness.</th>
The ZONE operator influences the query until the next field or ZONE limit operator, or until the closing parenthesis. It functions exclusively with tables built with zone support (refer to index_zones) and will be disregarded otherwise.
ZONESPAN:(h2)
only in a (single) title
The ZONESPAN limit operator resembles the ZONE operator but mandates that the match occurs within a single continuous span. In the example provided earlier, ZONESPAN:th hello world would not match the document, as "hello" and "world" do not appear within the same span.
Since certain characters function as operators in the query string, they must be escaped to prevent query errors or unintended matching conditions.
The following characters should be escaped using a backslash (\):
! " $ ' ( ) - / < @ \ ^ | ~
To escape a single quote ('), use one backslash:
SELECT * FROM your_index WHERE MATCH('l\'italiano');
For the other characters in the list mentioned earlier, which are operators or query constructs, they must be treated as simple characters by the engine, with a preceding escape character.
The backslash must also be escaped, resulting in two backslashes:
SELECT * FROM your_index WHERE MATCH('r\\&b | \\(official video\\)');
To use a backslash as a character, you must escape both the backslash as a character and the backslash as the escape operator, which requires four backslashes:
SELECT * FROM your_index WHERE MATCH('\\\\ABC');
When you are working with JSON data in Manticore Search and need to include a double quote (") within a JSON string, it's important to handle it with proper escaping. In JSON, a double quote within a string is escaped using a backslash (\). However, when inserting the JSON data through an SQL query, Manticore Search interprets the backslash (\) as an escape character within strings.
To ensure the double quote is correctly inserted into the JSON data, you need to escape the backslash itself. This results in using two backslashes (\\) before the double quote. For example:
insert into tbl(j) values('{"a": "\\"abc\\""}');
MySQL drivers provide escaping functions (e.g., mysqli_real_escape_string in PHP or conn.escape_string in Python), but they only escape specific characters.
You will still need to add escaping for the characters from the previously mentioned list that are not escaped by their respective functions.
Because these functions will escape the backslash for you, you only need to add one backslash.
This also applies to drivers that support (client-side) prepared statements. For example, with PHP PDO prepared statements, you need to add a backslash for the $ character:
$statement = $ln_sph->prepare( "SELECT * FROM index WHERE MATCH(:match)");
$match = '\$manticore';
$statement->bindParam(':match',$match,PDO::PARAM_STR);
$results = $statement->execute();
This results in the final query SELECT * FROM index WHERE MATCH('\\$manticore');
The same rules for the SQL protocol apply, with the exception that for JSON, the double quote must be escaped with a single backslash, while the rest of the characters require double escaping.
When using JSON libraries or functions that convert data structures to JSON strings, the double quote and single backslash are automatically escaped by these functions and do not need to be explicitly escaped.
The new official clients (which use the HTTP protocol) utilize common JSON libraries/functions available in their respective programming languages under the hood. The same rules for escaping mentioned earlier apply.
The asterisk (*) is a unique character that serves two purposes:
Unlike other special characters that function as operators, the asterisk cannot be escaped when it's in a position to provide one of its functionalities.
In non-wildcard queries, the asterisk does not require escaping, whether it's in the charset_table or not.
In wildcard queries, an asterisk in the middle of a word does not require escaping. As a wildcard operator (either at the beginning or end of the word), the asterisk will always be interpreted as the wildcard operator, even if escaping is applied.
To escape special characters in JSON nodes, use a backtick. For example:
MySQL [(none)]> select * from t where json.`a=b`=234;
+---------------------+-------------+------+
| id | json | text |
+---------------------+-------------+------+
| 8215557549554925578 | {"a=b":234} | |
+---------------------+-------------+------+
MySQL [(none)]> select * from t where json.`a:b`=123;
+---------------------+-------------+------+
| id | json | text |
+---------------------+-------------+------+
| 8215557549554925577 | {"a:b":123} | |
+---------------------+-------------+------+
Consider this complex query example:
"hello world" @title "example program"~5 @body python -(php|perl) @* code
The full meaning of this search is:
The OR operator takes precedence over AND, so "looking for cat | dog | mouse" means "looking for (cat | dog | mouse)" rather than "(looking for cat) | dog | mouse".
To comprehend how a query will be executed, Manticore Search provides query profiling tools to examine the query tree generated by a query expression.
To enable full-text query profiling with an SQL statement, you must activate it before executing the desired query:
SET profiling =1;
SELECT * FROM test WHERE MATCH('@title abc* @body hey');
To view the query tree, execute the SHOW PLAN command immediately after running the query:
SHOW PLAN;
This command will return the structure of the executed query. Keep in mind that the 3 statements - SET profiling, the query, and SHOW - must be executed within the same session.
When using the HTTP JSON protocol we can just enable "profile":true to get in response the full-text query tree structure.
{
"index":"test",
"profile":true,
"query":
{
"match_phrase": { "_all" : "had grown quite" }
}
}
The response will include a profile object containing a query member.
The query property holds the transformed full-text query tree. Each node consists of:
type: node type, which can be AND, OR, PHRASE, KEYWORD, etc.description: query subtree for this node represented as a string (in SHOW PLAN format)children: any child nodes, if presentmax_field_pos: maximum position within a fieldA keyword node will additionally include:
word: the transformed keyword.querypos: position of this keyword in the query.excluded: keyword excluded from the query.expanded: keyword added by prefix expansion.field_start: keyword must appear at the beginning of the field.field_end: keyword must appear at the end of the field.boost: the keyword's IDF will be multiplied by this value.SET profiling=1;
SELECT * FROM test WHERE MATCH('@title abc* @body hey');
SHOW PLAN \G
*************************** 1\. row ***************************
Variable: transformed_tree
Value: AND(
OR(fields=(title), KEYWORD(abcx, querypos=1, expanded), KEYWORD(abcm, querypos=1, expanded)),
AND(fields=(body), KEYWORD(hey, querypos=2)))
1 row in set (0.00 sec)
POST /search
{
"index": "forum",
"query": {"query_string": "i me"},
"_source": { "excludes":["*"] },
"limit": 1,
"profile":true
}
{
"took":1503,
"timed_out":false,
"hits":
{
"total":406301,
"hits":
[
{
"_id":"406443",
"_score":3493,
"_source":{}
}
]
},
"profile":
{
"query":
{
"type":"AND",
"description":"AND( AND(KEYWORD(i, querypos=1)), AND(KEYWORD(me, querypos=2)))",
"children":
[
{
"type":"AND",
"description":"AND(KEYWORD(i, querypos=1))",
"children":
[
{
"type":"KEYWORD",
"word":"i",
"querypos":1
}
]
},
{
"type":"AND",
"description":"AND(KEYWORD(me, querypos=2))",
"children":
[
{
"type":"KEYWORD",
"word":"me",
"querypos":2
}
]
}
]
}
}
}
$result = $index->search('i me')->setSource(['excludes'=>['*']])->setLimit(1)->profile()->get();
print_r($result->getProfile());
Array
(
[query] => Array
(
[type] => AND
[description] => AND( AND(KEYWORD(i, querypos=1)), AND(KEYWORD(me, querypos=2)))
[children] => Array
(
[0] => Array
(
[type] => AND
[description] => AND(KEYWORD(i, querypos=1))
[children] => Array
(
[0] => Array
(
[type] => KEYWORD
[word] => i
[querypos] => 1
)
)
)
[1] => Array
(
[type] => AND
[description] => AND(KEYWORD(me, querypos=2))
[children] => Array
(
[0] => Array
(
[type] => KEYWORD
[word] => me
[querypos] => 2
)
)
)
)
)
)
searchApi.search({"index":"forum","query":{"query_string":"i me"},"_source":{"excludes":["*"]},"limit":1,"profile":True})
{'hits': {'hits': [{u'_id': u'100', u'_score': 2500, u'_source': {}}],
'total': 1},
'profile': {u'query': {u'children': [{u'children': [{u'querypos': 1,
u'type': u'KEYWORD',
u'word': u'i'}],
u'description': u'AND(KEYWORD(i, querypos=1))',
u'type': u'AND'},
{u'children': [{u'querypos': 2,
u'type': u'KEYWORD',
u'word': u'me'}],
u'description': u'AND(KEYWORD(me, querypos=2))',
u'type': u'AND'}],
u'description': u'AND( AND(KEYWORD(i, querypos=1)), AND(KEYWORD(me, querypos=2)))',
u'type': u'AND'}},
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"forum","query":{"query_string":"i me"},"_source":{"excludes":["*"]},"limit":1,"profile":true});
{"hits": {"hits": [{"_id": "100", "_score": 2500, "_source": {}}],
"total": 1},
"profile": {"query": {"children": [{"children": [{"querypos": 1,
"type": "KEYWORD",
"word": "i"}],
"description": "AND(KEYWORD(i, querypos=1))",
"type": "AND"},
{"children": [{"querypos": 2,
"type": "KEYWORD",
"word": "me"}],
"description": "AND(KEYWORD(me, querypos=2))",
"type": "AND"}],
"description": "AND( AND(KEYWORD(i, querypos=1)), AND(KEYWORD(me, querypos=2)))",
"type": "AND"}},
"timed_out": False,
"took": 0}
query = new HashMap<String,Object>();
query.put("query_string","i me");
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
searchRequest.setQuery(query);
searchRequest.setProfile(true);
searchRequest.setLimit(1);
searchRequest.setSort(new ArrayList<String>(){{
add("*");
}});
searchResponse = searchApi.search(searchRequest);
class SearchResponse {
took: 18
timedOut: false
hits: class SearchResponseHits {
total: 1
hits: [{_id=100, _score=2500, _source={}}]
aggregations: null
}
profile: {query={type=AND, description=AND( AND(KEYWORD(i, querypos=1)), AND(KEYWORD(me, querypos=2))), children=[{type=AND, description=AND(KEYWORD(i, querypos=1)), children=[{type=KEYWORD, word=i, querypos=1}]}, {type=AND, description=AND(KEYWORD(me, querypos=2)), children=[{type=KEYWORD, word=me, querypos=2}]}]}}
}
object query = new { query_string="i me" };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Profile = true;
searchRequest.Limit = 1;
searchRequest.Sort = new List<Object> { "*" };
var searchResponse = searchApi.Search(searchRequest);
class SearchResponse {
took: 18
timedOut: false
hits: class SearchResponseHits {
total: 1
hits: [{_id=100, _score=2500, _source={}}]
aggregations: null
}
profile: {query={type=AND, description=AND( AND(KEYWORD(i, querypos=1)), AND(KEYWORD(me, querypos=2))), children=[{type=AND, description=AND(KEYWORD(i, querypos=1)), children=[{type=KEYWORD, word=i, querypos=1}]}, {type=AND, description=AND(KEYWORD(me, querypos=2)), children=[{type=KEYWORD, word=me, querypos=2}]}]}}
}
res = await searchApi.search({
index: 'test',
query: { query_string: 'Text' },
_source: { excludes: ['*'] },
limit: 1,
profile: true
});
{
"hits":
{
"hits":
[{
"_id": "1",
"_score": 1480,
"_source": {}
}],
"total": 1
},
"profile":
{
"query": {
"children":
[{
"children":
[{
"querypos": 1,
"type": "KEYWORD",
"word": "i"
}],
"description": "AND(KEYWORD(i, querypos=1))",
"type": "AND"
},
{
"children":
[{
"querypos": 2,
"type": "KEYWORD",
"word": "me"
}],
"description": "AND(KEYWORD(me, querypos=2))",
"type": "AND"
}],
"description": "AND( AND(KEYWORD(i, querypos=1)), AND(KEYWORD(me, querypos=2)))",
"type": "AND"
}
},
"timed_out": False,
"took": 0
}
searchRequest := manticoresearch.NewSearchRequest("test")
query := map[string]interface{} {"query_string": "Text"}
source := map[string]interface{} { "excludes": []string {"*"} }
searchRequest.SetQuery(query)
searchRequest.SetSource(source)
searchReq.SetLimit(1)
searchReq.SetProfile(true)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"hits":
{
"hits":
[{
"_id": "1",
"_score": 1480,
"_source": {}
}],
"total": 1
},
"profile":
{
"query": {
"children":
[{
"children":
[{
"querypos": 1,
"type": "KEYWORD",
"word": "i"
}],
"description": "AND(KEYWORD(i, querypos=1))",
"type": "AND"
},
{
"children":
[{
"querypos": 2,
"type": "KEYWORD",
"word": "me"
}],
"description": "AND(KEYWORD(me, querypos=2))",
"type": "AND"
}],
"description": "AND( AND(KEYWORD(i, querypos=1)), AND(KEYWORD(me, querypos=2)))",
"type": "AND"
}
},
"timed_out": False,
"took": 0
}
In some instances, the evaluated query tree may significantly differ from the original one due to expansions and other transformations.
SET profiling=1;
SELECT id FROM forum WHERE MATCH('@title way* @content hey') LIMIT 1;
SHOW PLAN;
Query OK, 0 rows affected (0.00 sec)
+--------+
| id |
+--------+
| 711651 |
+--------+
1 row in set (0.04 sec)
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Variable | Value |
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| transformed_tree | AND(
OR(
OR(
AND(fields=(title), KEYWORD(wayne, querypos=1, expanded)),
OR(
AND(fields=(title), KEYWORD(ways, querypos=1, expanded)),
AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded)))),
AND(fields=(title), KEYWORD(way, querypos=1, expanded)),
OR(fields=(title), KEYWORD(way*, querypos=1, expanded))),
AND(fields=(content), KEYWORD(hey, querypos=2))) |
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
POST /search
{
"index": "forum",
"query": {"query_string": "@title way* @content hey"},
"_source": { "excludes":["*"] },
"limit": 1,
"profile":true
}
{
"took":33,
"timed_out":false,
"hits":
{
"total":105,
"hits":
[
{
"_id":"711651",
"_score":2539,
"_source":{}
}
]
},
"profile":
{
"query":
{
"type":"AND",
"description":"AND( OR( OR( AND(fields=(title), KEYWORD(wayne, querypos=1, expanded)), OR( AND(fields=(title), KEYWORD(ways, querypos=1, expanded)), AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded)))), AND(fields=(title), KEYWORD(way, querypos=1, expanded)), OR(fields=(title), KEYWORD(way*, querypos=1, expanded))), AND(fields=(content), KEYWORD(hey, querypos=2)))",
"children":
[
{
"type":"OR",
"description":"OR( OR( AND(fields=(title), KEYWORD(wayne, querypos=1, expanded)), OR( AND(fields=(title), KEYWORD(ways, querypos=1, expanded)), AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded)))), AND(fields=(title), KEYWORD(way, querypos=1, expanded)), OR(fields=(title), KEYWORD(way*, querypos=1, expanded)))",
"children":
[
{
"type":"OR",
"description":"OR( AND(fields=(title), KEYWORD(wayne, querypos=1, expanded)), OR( AND(fields=(title), KEYWORD(ways, querypos=1, expanded)), AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded))))",
"children":
[
{
"type":"AND",
"description":"AND(fields=(title), KEYWORD(wayne, querypos=1, expanded))",
"fields":["title"],
"max_field_pos":0,
"children":
[
{
"type":"KEYWORD",
"word":"wayne",
"querypos":1,
"expanded":true
}
]
},
{
"type":"OR",
"description":"OR( AND(fields=(title), KEYWORD(ways, querypos=1, expanded)), AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded)))",
"children":
[
{
"type":"AND",
"description":"AND(fields=(title), KEYWORD(ways, querypos=1, expanded))",
"fields":["title"],
"max_field_pos":0,
"children":
[
{
"type":"KEYWORD",
"word":"ways",
"querypos":1,
"expanded":true
}
]
},
{
"type":"AND",
"description":"AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded))",
"fields":["title"],
"max_field_pos":0,
"children":
[
{
"type":"KEYWORD",
"word":"wayyy",
"querypos":1,
"expanded":true
}
]
}
]
}
]
},
{
"type":"AND",
"description":"AND(fields=(title), KEYWORD(way, querypos=1, expanded))",
"fields":["title"],
"max_field_pos":0,
"children":
[
{
"type":"KEYWORD",
"word":"way",
"querypos":1,
"expanded":true
}
]
},
{
"type":"OR",
"description":"OR(fields=(title), KEYWORD(way*, querypos=1, expanded))",
"fields":["title"],
"max_field_pos":0,
"children":
[
{
"type":"KEYWORD",
"word":"way*",
"querypos":1,
"expanded":true
}
]
}
]
},
{
"type":"AND",
"description":"AND(fields=(content), KEYWORD(hey, querypos=2))",
"fields":["content"],
"max_field_pos":0,
"children":
[
{
"type":"KEYWORD",
"word":"hey",
"querypos":2
}
]
}
]
}
}
}
$result = $index->search('@title way* @content hey')->setSource(['excludes'=>['*']])->setLimit(1)->profile()->get();
print_r($result->getProfile());
Array
(
[query] => Array
(
[type] => AND
[description] => AND( OR( OR( AND(fields=(title), KEYWORD(wayne, querypos=1, expanded)), OR( AND(fields=(title), KEYWORD(ways, querypos=1, expanded)), AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded)))), AND(fields=(title), KEYWORD(way, querypos=1, expanded)), OR(fields=(title), KEYWORD(way*, querypos=1, expanded))), AND(fields=(content), KEYWORD(hey, querypos=2)))
[children] => Array
(
[0] => Array
(
[type] => OR
[description] => OR( OR( AND(fields=(title), KEYWORD(wayne, querypos=1, expanded)), OR( AND(fields=(title), KEYWORD(ways, querypos=1, expanded)), AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded)))), AND(fields=(title), KEYWORD(way, querypos=1, expanded)), OR(fields=(title), KEYWORD(way*, querypos=1, expanded)))
[children] => Array
(
[0] => Array
(
[type] => OR
[description] => OR( AND(fields=(title), KEYWORD(wayne, querypos=1, expanded)), OR( AND(fields=(title), KEYWORD(ways, querypos=1, expanded)), AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded))))
[children] => Array
(
[0] => Array
(
[type] => AND
[description] => AND(fields=(title), KEYWORD(wayne, querypos=1, expanded))
[fields] => Array
(
[0] => title
)
[max_field_pos] => 0
[children] => Array
(
[0] => Array
(
[type] => KEYWORD
[word] => wayne
[querypos] => 1
[expanded] => 1
)
)
)
[1] => Array
(
[type] => OR
[description] => OR( AND(fields=(title), KEYWORD(ways, querypos=1, expanded)), AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded)))
[children] => Array
(
[0] => Array
(
[type] => AND
[description] => AND(fields=(title), KEYWORD(ways, querypos=1, expanded))
[fields] => Array
(
[0] => title
)
[max_field_pos] => 0
[children] => Array
(
[0] => Array
(
[type] => KEYWORD
[word] => ways
[querypos] => 1
[expanded] => 1
)
)
)
[1] => Array
(
[type] => AND
[description] => AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded))
[fields] => Array
(
[0] => title
)
[max_field_pos] => 0
[children] => Array
(
[0] => Array
(
[type] => KEYWORD
[word] => wayyy
[querypos] => 1
[expanded] => 1
)
)
)
)
)
)
)
[1] => Array
(
[type] => AND
[description] => AND(fields=(title), KEYWORD(way, querypos=1, expanded))
[fields] => Array
(
[0] => title
)
[max_field_pos] => 0
[children] => Array
(
[0] => Array
(
[type] => KEYWORD
[word] => way
[querypos] => 1
[expanded] => 1
)
)
)
[2] => Array
(
[type] => OR
[description] => OR(fields=(title), KEYWORD(way*, querypos=1, expanded))
[fields] => Array
(
[0] => title
)
[max_field_pos] => 0
[children] => Array
(
[0] => Array
(
[type] => KEYWORD
[word] => way*
[querypos] => 1
[expanded] => 1
)
)
)
)
)
[1] => Array
(
[type] => AND
[description] => AND(fields=(content), KEYWORD(hey, querypos=2))
[fields] => Array
(
[0] => content
)
[max_field_pos] => 0
[children] => Array
(
[0] => Array
(
[type] => KEYWORD
[word] => hey
[querypos] => 2
)
)
)
)
)
)
searchApi.search({"index":"forum","query":{"query_string":"@title way* @content hey"},"_source":{"excludes":["*"]},"limit":1,"profile":true})
{'hits': {'hits': [{u'_id': u'2811025403043381551',
u'_score': 2643,
u'_source': {}}],
'total': 1},
'profile': {u'query': {u'children': [{u'children': [{u'expanded': True,
u'querypos': 1,
u'type': u'KEYWORD',
u'word': u'way*'}],
u'description': u'AND(fields=(title), KEYWORD(way*, querypos=1, expanded))',
u'fields': [u'title'],
u'type': u'AND'},
{u'children': [{u'querypos': 2,
u'type': u'KEYWORD',
u'word': u'hey'}],
u'description': u'AND(fields=(content), KEYWORD(hey, querypos=2))',
u'fields': [u'content'],
u'type': u'AND'}],
u'description': u'AND( AND(fields=(title), KEYWORD(way*, querypos=1, expanded)), AND(fields=(content), KEYWORD(hey, querypos=2)))',
u'type': u'AND'}},
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"forum","query":{"query_string":"@title way* @content hey"},"_source":{"excludes":["*"]},"limit":1,"profile":true});
{"hits": {"hits": [{"_id": "2811025403043381551",
"_score": 2643,
"_source": {}}],
"total": 1},
"profile": {"query": {"children": [{"children": [{"expanded": True,
"querypos": 1,
"type": "KEYWORD",
"word": "way*"}],
"description": "AND(fields=(title), KEYWORD(way*, querypos=1, expanded))",
"fields": ["title"],
"type": "AND"},
{"children": [{"querypos": 2,
"type": "KEYWORD",
"word": "hey"}],
"description": "AND(fields=(content), KEYWORD(hey, querypos=2))",
"fields": ["content"],
"type": "AND"}],
"description": "AND( AND(fields=(title), KEYWORD(way*, querypos=1, expanded)), AND(fields=(content), KEYWORD(hey, querypos=2)))",
"type": "AND"}},
"timed_out": False,
"took": 0}
query = new HashMap<String,Object>();
query.put("query_string","@title way* @content hey");
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
searchRequest.setQuery(query);
searchRequest.setProfile(true);
searchRequest.setLimit(1);
searchRequest.setSort(new ArrayList<String>(){{
add("*");
}});
searchResponse = searchApi.search(searchRequest);
class SearchResponse {
took: 18
timedOut: false
hits: class SearchResponseHits {
total: 1
hits: [{_id=2811025403043381551, _score=2643, _source={}}]
aggregations: null
}
profile: {query={type=AND, description=AND( AND(fields=(title), KEYWORD(way*, querypos=1, expanded)), AND(fields=(content), KEYWORD(hey, querypos=2))), children=[{type=AND, description=AND(fields=(title), KEYWORD(way*, querypos=1, expanded)), fields=[title], children=[{type=KEYWORD, word=way*, querypos=1, expanded=true}]}, {type=AND, description=AND(fields=(content), KEYWORD(hey, querypos=2)), fields=[content], children=[{type=KEYWORD, word=hey, querypos=2}]}]}}
}
object query = new { query_string="@title way* @content hey" };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Profile = true;
searchRequest.Limit = 1;
searchRequest.Sort = new List<Object> { "*" };
var searchResponse = searchApi.Search(searchRequest);
class SearchResponse {
took: 18
timedOut: false
hits: class SearchResponseHits {
total: 1
hits: [{_id=2811025403043381551, _score=2643, _source={}}]
aggregations: null
}
profile: {query={type=AND, description=AND( AND(fields=(title), KEYWORD(way*, querypos=1, expanded)), AND(fields=(content), KEYWORD(hey, querypos=2))), children=[{type=AND, description=AND(fields=(title), KEYWORD(way*, querypos=1, expanded)), fields=[title], children=[{type=KEYWORD, word=way*, querypos=1, expanded=true}]}, {type=AND, description=AND(fields=(content), KEYWORD(hey, querypos=2)), fields=[content], children=[{type=KEYWORD, word=hey, querypos=2}]}]}}
}
res = await searchApi.search({
index: 'test',
query: { query_string: '@content 1'},
_source: { excludes: ["*"] },
limit:1,
profile":true
});
{
"hits":
{
"hits":
[{
"_id": "1",
"_score": 1480,
"_source": {}
}],
"total": 1
},
"profile":
{
"query":
{
"children":
[{
"children":
[{
"expanded": True,
"querypos": 1,
"type": "KEYWORD",
"word": "1*"
}],
"description": "AND(fields=(content), KEYWORD(1*, querypos=1, expanded))",
"fields": ["content"],
"type": "AND"
}],
"description": "AND(fields=(content), KEYWORD(1*, querypos=1))",
"type": "AND"
}},
"timed_out": False,
"took": 0
}
searchRequest := manticoresearch.NewSearchRequest("test")
query := map[string]interface{} {"query_string": "1*"}
source := map[string]interface{} { "excludes": []string {"*"} }
searchRequest.SetQuery(query)
searchRequest.SetSource(source)
searchReq.SetLimit(1)
searchReq.SetProfile(true)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"hits":
{
"hits":
[{
"_id": "1",
"_score": 1480,
"_source": {}
}],
"total": 1
},
"profile":
{
"query":
{
"children":
[{
"children":
[{
"expanded": True,
"querypos": 1,
"type": "KEYWORD",
"word": "1*"
}],
"description": "AND(fields=(content), KEYWORD(1*, querypos=1, expanded))",
"fields": ["content"],
"type": "AND"
}],
"description": "AND(fields=(content), KEYWORD(1*, querypos=1))",
"type": "AND"
}},
"timed_out": False,
"took": 0
}
The SQL statement EXPLAIN QUERY enables the display of the execution tree for a given full-text query without performing an actual search query on the table.
EXPLAIN QUERY index_base '@title running @body dog'\G
EXPLAIN QUERY index_base '@title running @body dog'\G
*************************** 1\. row ***************************
Variable: transformed_tree
Value: AND(
OR(
AND(fields=(title), KEYWORD(run, querypos=1, morphed)),
AND(fields=(title), KEYWORD(running, querypos=1, morphed))))
AND(fields=(body), KEYWORD(dog, querypos=2, morphed)))
EXPLAIN QUERY ... option format=dot allows displaying the execution tree of a provided full-text query in a hierarchical format suitable for visualization by existing tools, such as https://dreampuf.github.io/GraphvizOnline:

EXPLAIN QUERY tbl 'i me' option format=dot\G
EXPLAIN QUERY tbl 'i me' option format=dot\G
*************************** 1. row ***************************
Variable: transformed_tree
Value: digraph "transformed_tree"
{
0 [shape=record,style=filled,bgcolor="lightgrey" label="AND"]
0 -> 1
1 [shape=record,style=filled,bgcolor="lightgrey" label="AND"]
1 -> 2
2 [shape=record label="i | { querypos=1 }"]
0 -> 3
3 [shape=record,style=filled,bgcolor="lightgrey" label="AND"]
3 -> 4
4 [shape=record label="me | { querypos=2 }"]
}
When using an expression ranker, it's possible to reveal the values of the calculated factors with the PACKEDFACTORS() function.
The function returns:
These values can be utilized to understand why certain documents receive lower or higher scores in a search or to refine the existing ranking expression.
Example:
SELECT id, PACKEDFACTORS() FROM test1 WHERE MATCH('test one') OPTION ranker=expr('1')\G
id: 1
packedfactors(): bm25=569, bm25a=0.617197, field_mask=2, doc_word_count=2,
field1=(lcs=1, hit_count=2, word_count=2, tf_idf=0.152356,
min_idf=-0.062982, max_idf=0.215338, sum_idf=0.152356, min_hit_pos=4,
min_best_span_pos=4, exact_hit=0, max_window_hits=1, min_gaps=2,
exact_order=1, lccs=1, wlccs=0.215338, atc=-0.003974),
word0=(tf=1, idf=-0.062982),
word1=(tf=1, idf=0.215338)
1 row in set (0.00 sec)
Queries can be automatically optimized if OPTION boolean_simplify=1 is specified. Some transformations performed by this optimization include:
((A | B) | C) becomes (A | B | C); ((A B) C) becomes (A B C)((A !N1) !N2) becomes (A !(N1 | N2))((A !N) | (B !N)) becomes ((A | B) !N)((A !(N AA)) | (B !(N BB))) becomes (((A | B) !N) | (A !AA) | (B !BB)) if the cost of evaluating N is greater than the sum of evaluating A and B((A (N | AA)) | (B (N | BB))) becomes (((A | B) N) | (A AA) | (B BB)) if the cost of evaluating N is greater than the sum of evaluating A and B(A | "A B"~N) becomes A; ("A B" | "A B C") becomes "A B"; ("A B"~N | "A B C"~N) becomes ("A B"~N)("X A B" | "Y A B") becomes ("("X"|"Y") A B")((A !X) | (A !Y) | (A !Z)) becomes (A !(X Y Z))((A !(N | N1)) | (B !(N | N2))) becomes (( (A !N1) | (B !N2) ) !N)boolean_simplify=0 value. Simplifications often benefit complex queries or algorithmically generated queries.Queries like -dog, which could potentially include all documents from the collection are not allowed by default. To allow them, you must specify not_terms_only_allowed=1 either as a global setting or as a search option.
When you run a query via SQL over the MySQL protocol, you receive the requested columns as a result or an empty result set if nothing is found.
SELECT * FROM tbl;
+------+------+--------+
| id | age | name |
+------+------+--------+
| 1 | 25 | joe |
| 2 | 25 | mary |
| 3 | 33 | albert |
+------+------+--------+
3 rows in set (0.00 sec)
Additionally, you can use the SHOW META call to see extra meta-information about the latest query.
SELECT id,story_author,comment_author FROM hn_small WHERE story_author='joe' LIMIT 3; SHOW META;
++--------+--------------+----------------+
| id | story_author | comment_author |
+--------+--------------+----------------+
| 152841 | joe | SwellJoe |
| 161323 | joe | samb |
| 163735 | joe | jsjenkins168 |
+--------+--------------+----------------+
3 rows in set (0.01 sec)
+----------------+-------+
| Variable_name | Value |
+----------------+-------+
| total | 3 |
| total_found | 20 |
| total_relation | gte |
| time | 0.010 |
+----------------+-------+
4 rows in set (0.00 sec)
In some cases, such as when performing a faceted search, you may receive multiple result sets as a response to your SQL query.
SELECT * FROM tbl WHERE MATCH('joe') FACET age;
+------+------+
| id | age |
+------+------+
| 1 | 25 |
+------+------+
1 row in set (0.00 sec)
+------+----------+
| age | count(*) |
+------+----------+
| 25 | 1 |
+------+----------+
1 row in set (0.00 sec)
In case of a warning, the result set will include a warning flag, and you can see the warning using SHOW WARNINGS.
SELECT * from tbl where match('"joe"/3'); show warnings;
+------+------+------+
| id | age | name |
+------+------+------+
| 1 | 25 | joe |
+------+------+------+
1 row in set, 1 warning (0.00 sec)
+---------+------+--------------------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+--------------------------------------------------------------------------------------------+
| warning | 1000 | quorum threshold too high (words=1, thresh=3); replacing quorum operator with AND operator |
+---------+------+--------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
If your query fails, you will receive an error:
SELECT * from tbl where match('@surname joe');
ERROR 1064 (42000): index idx: query error: no field 'surname' found in schema
Via the HTTP JSON interface, the query result is sent as a JSON document. Example:
{
"took":10,
"timed_out": false,
"hits":
{
"total": 2,
"hits":
[
{
"_id": "1",
"_score": 1,
"_source": { "gid": 11 }
},
{
"_id": "2",
"_score": 1,
"_source": { "gid": 12 }
}
]
}
}
took: time in milliseconds it took to execute the searchtimed_out: whether the query timed out or nothits: search results, with the following properties:total: total number of matching documentshits: an array containing matchesThe query result can also include query profile information. See Query profile.
Each match in the hits array has the following properties:
_id: match id_score: match weight, calculated by the ranker_source: an array containing the attributes of this matchBy default, all attributes are returned in the _source array. You can use the _source property in the request payload to select the fields you want to include in the result set. Example:
{
"index":"test",
"_source":"attr*",
"query": { "match_all": {} }
}
You can specify the attributes you want to include in the query result as a string ("_source": "attr*") or as an array of strings ("_source": [ "attr1", "attri*" ]"). Each entry can be an attribute name or a wildcard (*, % and ? symbols are supported).
You can also explicitly specify which attributes you want to include and which to exclude from the result set using the includes and excludes properties:
"_source":
{
"includes": [ "attr1", "attri*" ],
"excludes": [ "*desc*" ]
}
An empty list of includes is interpreted as "include all attributes," while an empty list of excludes does not match anything. If an attribute matches both the includes and excludes, then the excludes win.
WHERE is an SQL clause that works for both full-text matching and additional filtering. The following operators are available:
<, >, <=, >=, =, <>, BETWEEN, IN, IS NULLAND, OR, NOTMATCH('query') is supported and maps to a full-text query.
The {col_name | expr_alias} [NOT] IN @uservar condition syntax is supported. Refer to the SET syntax for a description of global user variables.
If you prefer the HTTP JSON interface, you can also apply filtering. It might seem more complex than SQL, but it is recommended for cases when you need to prepare a query programmatically, such as when a user fills out a form in your application.
Here's an example of several filters in a bool query.
This full-text query matches all documents containing product in any field. These documents must have a price greater than or equal to 500 (gte) and less than or equal to 1000 (lte). All of these documents must not have a revision less than 15 (lt).
POST /search
{
"index": "test1",
"query": {
"bool": {
"must": [
{ "match" : { "_all" : "product" } },
{ "range": { "price": { "gte": 500, "lte": 1000 } } }
],
"must_not": {
"range": { "revision": { "lt": 15 } }
}
}
}
}
The bool query matches documents based on boolean combinations of other queries and/or filters. Queries and filters must be specified in must, should, or must_not sections and can be nested.
POST /search
{
"index":"test1",
"query": {
"bool": {
"must": [
{ "match": {"_all":"keyword"} },
{ "range": { "revision": { "gte": 14 } } }
]
}
}
}
Queries and filters specified in the must section are required to match the documents. If multiple fulltext queries or filters are specified, all of them must match. This is the equivalent of AND queries in SQL. Note that if you want to match against an array (multi-value attribute), you can specify the attribute multiple times. The result will be positive only if all the queried values are found in the array, e.g.:
"must": [
{"equals" : { "product_codes": 5 }},
{"equals" : { "product_codes": 6 }}
]
Note also, it may be better in terms of performance to use:
{"in" : { "all(product_codes)": [5,6] }}
(see details below).
Queries and filters specified in the should section should match the documents. If some queries are specified in must or must_not, should queries are ignored. On the other hand, if there are no queries other than should, then at least one of these queries must match a document for it to match the bool query. This is the equivalent of OR queries. Note, if you want to match against an array (multi-value attribute) you can specify the attribute multiple times, e.g.:
"should": [
{"equals" : { "product_codes": 7 }},
{"equals" : { "product_codes": 8 }}
]
Note also, it may be better in terms of performance to use:
{"in" : { "any(product_codes)": [7,8] }}
(see details below).
Queries and filters specified in the must_not section must not match the documents. If several queries are specified under must_not, the document matches if none of them match.
POST /search
{
"index":"t",
"query": {
"bool": {
"should": [
{
"equals": {
"b": 1
}
},
{
"equals": {
"b": 3
}
}
],
"must": [
{
"equals": {
"a": 1
}
}
],
"must_not": {
"equals": {
"b": 2
}
}
}
}
}
A bool query can be nested inside another bool so you can make more complex queries. To make a nested boolean query just use another bool instead of must, should or must_not. Here is how this query:
a = 2 and (a = 10 or b = 0)
should be presented in JSON.
POST /search
{
"index":"t",
"query": {
"bool": {
"must": [
{
"equals": {
"a": 2
}
},
{
"bool": {
"should": [
{
"equals": {
"a": 10
}
},
{
"equals": {
"b": 0
}
}
]
}
}
]
}
}
}
More complex query:
(a = 1 and b = 1) or (a = 10 and b = 2) or (b = 0)
POST /search
{
"index":"t",
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"equals": {
"a": 1
}
},
{
"equals": {
"b": 1
}
}
]
}
},
{
"bool": {
"must": [
{
"equals": {
"a": 10
}
},
{
"equals": {
"b": 2
}
}
]
}
},
{
"bool": {
"must": [
{
"equals": {
"b": 0
}
}
]
}
}
]
}
}
}
Queries in SQL format (query_string) can also be used in bool queries.
POST /search
{
"index": "test1",
"query": {
"bool": {
"must": [
{ "query_string" : "product" },
{ "query_string" : "good" }
]
}
}
}
Equality filters are the simplest filters that work with integer, float and string attributes.
POST /search
{
"index":"test1",
"query": {
"equals": { "price": 500 }
}
}
Filter equals can be applied to a multi-value attribute and you can use:
any() which will be positive if the attribute has at least one value which equals to the queried value;all() which will be positive if the attribute has a single value and it equals to the queried valuePOST /search
{
"index":"test1",
"query": {
"equals": { "any(price)": 100 }
}
}
Set filters check if attribute value is equal to any of the values in the specified set.
Set filters support integer, string and multi-value attributes.
POST /search
{
"index":"test1",
"query": {
"in": {
"price": [1,10,100]
}
}
}
When applied to a multi-value attribute you can use:
any() (equivalent to no function) which will be positive if there's at least one match between the attribute values and the queried values;all() which will be positive if all the attribute values are in the queried setPOST /search
{
"index":"test1",
"query": {
"in": {
"all(price)": [1,10]
}
}
}
Range filters match documents that have attribute values within a specified range.
Range filters support the following properties:
gte: greater than or equal togt: greater thanlte: less than or equal tolt: less thanPOST /search
{
"index":"test1",
"query": {
"range": {
"price": {
"gte": 500,
"lte": 1000
}
}
}
}
geo_distance filters are used to filter the documents that are within a specific distance from a geo location.
Specifies the pin location, in degrees. Distances are calculated from this point.
Specifies the attributes that contain latitude and longitude.
Specifies distance calculation function. Can be either adaptive or haversine. adaptive is faster and more precise, for more details see GEODIST(). Optional, defaults to adaptive.
Specifies the maximum distance from the pin locations. All documents within this distance match. The distance can be specified in various units. If no unit is specified, the distance is assumed to be in meters. Here is a list of supported distance units:
m or meterskm or kilometerscm or centimetersmm or millimetersmi or milesyd or yardsft or feetin or inchNM, nmi or nauticalmileslocation_anchor and location_source properties accept the following latitude/longitude formats:
{ "lat": "attr_lat", "lon": "attr_lon" }"attr_lat, attr_lon"[attr_lon, attr_lat]Latitude and longitude are specified in degrees.
POST /search
{
"index":"test",
"query": {
"geo_distance": {
"location_anchor": {"lat":49, "lon":15},
"location_source": {"attr_lat, attr_lon"},
"distance_type": "adaptive",
"distance":"100 km"
}
}
}
POST /search
{
"index": "geodemo",
"query": {
"bool": {
"must": [
{
"match": {
"*": "station"
}
},
{
"equals": {
"state_code": "ENG"
}
},
{
"geo_distance": {
"distance_type": "adaptive",
"location_anchor": {
"lat": 52.396,
"lon": -1.774
},
"location_source": "latitude_deg,longitude_deg",
"distance": "10000 m"
}
}
]
}
}
}
Manticore enables the use of arbitrary arithmetic expressions through both SQL and HTTP, incorporating attribute values, internal attributes (document ID and relevance weight), arithmetic operations, several built-in functions, and user-defined functions. Below is the complete reference list for quick access.
+, -, *, /, %, DIV, MOD
Standard arithmetic operators are available. Arithmetic calculations involving these operators can be executed in three different modes:
The expression parser automatically switches to integer mode if no operations result in a floating point value. Otherwise, it uses the default floating point mode. For example, a+b will be computed using 32-bit integers if both arguments are 32-bit integers; or using 64-bit integers if both arguments are integers but one of them is 64-bit; or in floats otherwise. However, a/b or sqrt(a) will always be computed in floats, as these operations return a non-integer result. To avoid this, you can use IDIV(a,b) or a DIV b form. Additionally, a*b will not automatically promote to 64-bit when arguments are 32-bit. To enforce 64-bit results, use BIGINT(), but note that if non-integer operations are present, BIGINT() will simply be ignored.
<, > <=, >=, =, <>
The comparison operators return 1.0 when the condition is true and 0.0 otherwise. For example, (a=b)+3 evaluates to 4 when attribute a is equal to attribute b, and to 3 when a is not. Unlike MySQL, the equality comparisons (i.e., = and <> operators) include a small equality threshold (1e-6 by default). If the difference between the compared values is within the threshold, they are considered equal.
The BETWEEN and IN operators, in the case of multi-value attributes, return true if at least one value matches the condition (similar to ANY()). The IN operator does not support JSON attributes. The IS (NOT) NULL operator is supported only for JSON attributes.
AND, OR, NOT
Boolean operators (AND, OR, NOT) behave as expected. They are left-associative and have the lowest priority compared to other operators. NOT has higher priority than AND and OR but still less than any other operator. AND and OR share the same priority, so using parentheses is recommended to avoid confusion in complex expressions.
&, |
These operators perform bitwise AND and OR respectively. The operands must be of integer types.
In the HTTP JSON interface, expressions are supported via script_fields and expressions.
{
"index": "test",
"query": {
"match_all": {}
}, "script_fields": {
"add_all": {
"script": {
"inline": "( gid * 10 ) | crc32(title)"
}
},
"title_len": {
"script": {
"inline": "crc32(title)"
}
}
}
}
In this example, two expressions are created: add_all and title_len. The first expression calculates ( gid * 10 ) | crc32(title) and stores the result in the add_all attribute. The second expression calculates crc32(title) and stores the result in the title_len attribute.
Currently, only inline expressions are supported. The value of the inline property (the expression to compute) has the same syntax as SQL expressions.
The expression name can be utilized in filtering or sorting.
{
"index":"movies_rt",
"script_fields":{
"cond1":{
"script":{
"inline":"actor_2_facebook_likes =296 OR movie_facebook_likes =37000"
}
},
"cond2":{
"script":{
"inline":"IF (IN (content_rating,'TV-PG','PG'),2, IF(IN(content_rating,'TV-14','PG-13'),1,0))"
}
}
},
"limit":10,
"sort":[
{
"cond2":"desc"
},
{
"actor_1_name":"asc"
},
{
"actor_2_name":"desc"
}
],
"profile":true,
"query":{
"bool":{
"must":[
{
"match":{
"*":"star"
}
},
{
"equals":{
"cond1":1
}
}
],
"must_not":[
{
"equals":{
"content_rating":"R"
}
}
]
}
}
}
By default, expression values are included in the _source array of the result set. If the source is selective (see Source selection), the expression name can be added to the _source parameter in the request. Note, the names of the expressions must be in lowercase.
expressions is an alternative to script_fields with a simpler syntax. The example request adds two expressions and stores the results into add_all and title_len attributes. Note, the names of the expressions must be in lowercase.
{
"index": "test",
"query": { "match_all": {} },
"expressions":
{
"add_all": "( gid * 10 ) | crc32(title)",
"title_len": "crc32(title)"
}
}
The SQL SELECT clause and the HTTP /search endpoint support a number of options that can be used to fine-tune search behavior.
SQL:
SELECT ... [OPTION <optionname>=<value> [ , ... ]] [/*+ [NO_][ColumnarScan|DocidIndex|SecondaryIndex(<attribute>[,...])]] /*]
HTTP:
POST /search
{
"index" : "index_name",
"options":
{
"optionname": "value",
"optionname2": <value2>
}
}
SQL:
SELECT * FROM test WHERE MATCH('@title hello @body world')
OPTION ranker=bm25, max_matches=3000,
field_weights=(title=10, body=3), agent_query_timeout=10000
+------+-------+-------+
| id | title | body |
+------+-------+-------+
| 1 | hello | world |
+------+-------+-------+
1 row in set (0.00 sec)
POST /search
{
"index" : "test",
"query": {
"match": {
"title": "hello"
},
"match": {
"body": "world"
}
},
"options":
{
"ranker": "bm25",
"max_matches": 3000,
"field_weights": {
"title": 10,
"body": 3
},
"agent_query_timeout": 10000
}
}
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1,
"total_relation": "eq",
"hits": [
{
"_id": "1",
"_score": 10500,
"_source": {
"title": "hello",
"body": "world"
}
}
]
}
}
Supported options are:
Integer. Enables or disables guaranteed aggregate accuracy when running groupby queries in multiple threads. Default is 0.
When running a groupby query, it can be run in parallel on a plain table with several pseudo shards (if pseudo_sharding is on). A similar approach works on RT tables. Each shard/chunk executes the query, but the number of groups is limited by max_matches. If the result sets from different shards/chunks have different groups, the group counts and aggregates may be inaccurate. Note that Manticore tries to increase max_matches up to max_matches_increase_threshold based on the number of unique values of the groupby attribute (retrieved from secondary indexes). If it succeeds, there will be no loss in accuracy.
However, if the number of unique values of the groupby attribute is high, further increasing max_matches may not be a good strategy because it can lead to a loss in performance and higher memory usage. Setting accurate_aggregation to 1 forces groupby searches to run in a single thread, which fixes the accuracy issue. Note that running in a single thread is only enforced when max_matches cannot be set high enough; otherwise, searches with accurate_aggregation=1 will still run in multiple threads.
Overall, setting accurate_aggregation to 1 ensures group count and aggregate accuracy in RT tables and plain tables with pseudo_sharding=1. The drawback is that searches will run slower since they will be forced to operate in a single thread.
However, if we have an RT table and a plain table containing the same data, and we run a query with accurate_aggregation=1, we might still receive different results. This occurs because the daemon might choose different max_matches settings for the RT and plain table due to the max_matches_increase_threshold setting.
Integer. Max time in milliseconds to wait for remote queries to complete, see this section.
0 or 1 (0 by default). boolean_simplify=1 enables simplifying the query to speed it up.
String, user comment that gets copied to a query log file.
Integer. Max found matches threshold. The value is selected automatically if not specified.
N = 0 disables the thresholdN > 0: instructs Manticore to stop looking for results as soon as it finds N documents.In case Manticore cannot calculate the exact matching documents count, you will see total_relation: gte in the query meta information, which means that the actual count is Greater Than or Equal to the total (total_found in SHOW META via SQL, hits.total in JSON via HTTP). If the total value is precise, you'll get total_relation: eq.
Integer. Default is 3500. This option sets the threshold below which counts returned by count distinct are guaranteed to be exact within a plain table.
Accepted values range from 500 to 15500. Values outside this range will be clamped.
When this option is set to 0, it enables an algorithm that ensures exact counts. This algorithm collects {group, value} pairs, sorts them, and periodically eliminates duplicates. The result is precise counts within a plain table. However, this approach is not suitable for high-cardinality datasets due to its high memory consumption and slow query execution.
When distinct_precision_threshold is set to a value greater than 0, Manticore employs a different algorithm. It loads counts into a hash table and returns the size of the table. If the hash table becomes too large, its contents are moved into a HyperLogLog data structure. At this point, the counts become approximate because HyperLogLog is a probabilistic algorithm. This approach maintains a fixed maximum memory usage per group, but there is a tradeoff in count accuracy.
The accuracy of the HyperLogLog and the threshold for converting from the hash table to HyperLogLog are derived from the distinct_precision_threshold setting. It's important to use this option with caution since doubling its value will also double the maximum memory required to calculate counts. The maximum memory usage can be roughly estimated using this formula: 64 * max_matches * distinct_precision_threshold, although in practice, count calculations often use less memory than the worst-case scenario.
0 or 1 (0 by default). Expands keywords with exact forms and/or stars when possible. Refer to expand_keywords for more details.
Named integer list (per-field user weights for ranking).
Example:
SELECT ... OPTION field_weights=(title=10, body=3)
Use global statistics (frequencies) from the global_idf file for IDF computations.
Quoted, comma-separated list of IDF computation flags. Known flags are:
normalized: BM25 variant, idf = log((N-n+1)/n), as per Robertson et alplain: plain variant, idf = log(N/n), as per Sparck-Jonestfidf_normalized: additionally divide IDF by query word count, so that TF*IDF fits into [0, 1] rangetfidf_unnormalized: do not additionally divide IDF by query word count where N is the collection size and n is the number of matched documentsThe historically default IDF (Inverse Document Frequency) in Manticore is equivalent to OPTION idf='normalized,tfidf_normalized', and those normalizations may cause several undesired effects.
First, idf=normalized causes keyword penalization. For instance, if you search for the | something and the occurs in more than 50% of the documents, then documents with both keywords the and something will get less weight than documents with just one keyword something. Using OPTION idf=plain avoids this. Plain IDF varies in [0, log(N)] range, and keywords are never penalized; while the normalized IDF varies in [-log(N), log(N)] range, and too frequent keywords are penalized.
Second, idf=tfidf_normalized leads to IDF drift across queries. Historically, IDF was also divided by the query keyword count, ensuring the entire sum(tf*idf) across all keywords remained within the [0,1] range. However, this meant that queries like word1 and word1 | nonmatchingword2 would assign different weights to the exact same result set, as the IDFs for both word1 and nonmatchingword2 would be divided by 2. Using OPTION idf='tfidf_unnormalized' resolves this issue. Keep in mind that BM25, BM25A, BM25F() ranking factors will be adjusted accordingly when you disable this normalization.
IDF flags can be combined; plain and normalized are mutually exclusive; tfidf_unnormalized and tfidf_normalized are also mutually exclusive; and unspecified flags in such mutually exclusive groups default to their original settings. This means OPTION idf=plain is the same as specifying OPTION idf='plain,tfidf_normalized' in its entirety.
Named integer list. Per-table user weights for ranking.
0 or 1, automatically sum DFs over all local parts of a distributed table, ensuring consistent (and accurate) IDF across a locally sharded table. Enabled dy default for disk chunks of the RT table.
0 or 1 (0 by default). Setting low_priority=1 executes the query with a lower priority, rescheduling its jobs 10 times less frequently than other queries with normal priority.
Integer. Per-query max matches value.
The maximum number of matches that the server retains in RAM for each table and can return to the client. The default is 1000.
Introduced to control and limit RAM usage, the max_matches setting determines how many matches will be kept in RAM while searching each table. Every match found is still processed, but only the best N of them will be retained in memory and returned to the client in the end. For example, suppose a table contains 2,000,000 matches for a query. It's rare that you would need to retrieve all of them. Instead, you need to scan all of them but only choose the "best" 500, for instance, based on some criteria (e.g., sorted by relevance, price, or other factors) and display those 500 matches to the end user in pages of 20 to 100 matches. Tracking only the best 500 matches is much more RAM and CPU efficient than keeping all 2,000,000 matches, sorting them, and then discarding everything but the first 20 needed for the search results page. max_matches controls the N in that "best N" amount.
This parameter significantly impacts per-query RAM and CPU usage. Values of 1,000 to 10,000 are generally acceptable, but higher limits should be used with caution. Carelessly increasing max_matches to 1,000,000 means that searchd will have to allocate and initialize a 1-million-entry matches buffer for every query. This will inevitably increase per-query RAM usage and, in some cases, can noticeably affect performance.
Refer to max_matches_increase_threshold for additional information on how it can influence the behavior of the max_matches option.
Integer. Sets the threshold that max_matches can be increased to. Default is 16384.
Manticore may increase max_matches to enhance groupby and/or aggregation accuracy when pseudo_sharding is enabled, and if it detects that the number of unique values of the groupby attribute is less than this threshold. Loss of accuracy may occur when pseudo-sharding executes the query in multiple threads or when an RT table conducts parallel searches in disk chunks.
If the number of unique values of the groupby attribute is less than the threshold, max_matches will be set to this number. Otherwise, the default max_matches will be used.
If max_matches was explicitly set in query options, this threshold has no effect.
Keep in mind that if this threshold is set too high, it will result in increased memory consumption and general performance degradation.
You can also enforce a guaranteed groupby/aggregate accuracy mode using the accurate_aggregation option.
Sets the maximum search query time in milliseconds. Must be a non-negative integer. The default value is 0, which means "do not limit." Local search queries will be stopped once the specified time has elapsed. Note that if you're performing a search that queries multiple local tables, this limit applies to each table separately. Be aware that this may slightly increase the query's response time due to the overhead caused by constantly tracking whether it's time to stop the query.
Integer. Maximum predicted search time; see predicted_time_costs.
none allows replacing all query terms with their exact forms if the table was built with index_exact_words enabled. This is useful for preventing stemming or lemmatizing query terms.
0 or 1 allows standalone negation for the query. The default is 0. See also the corresponding global setting.
MySQL [(none)]> select * from tbl where match('-donald');
ERROR 1064 (42000): index t: query error: query is non-computable (single NOT operator)
MySQL [(none)]> select * from t where match('-donald') option not_terms_only_allowed=1;
+---------------------+-----------+
| id | field |
+---------------------+-----------+
| 1658178727135150081 | smth else |
+---------------------+-----------+
Choose from the following options:
proximity_bm25bm25nonewordcountproximitymatchanyfieldmasksph04exprexportFor more details on each ranker, refer to Search results ranking.
Allows you to specify a specific integer seed value for an ORDER BY RAND() query, for example: ... OPTION rand_seed=1234. By default, a new and different seed value is autogenerated for every query.
Integer. Distributed retries count.
Integer. Distributed retry delay, in milliseconds.
pq - priority queue, set by defaultkbuffer - provides faster sorting for already pre-sorted data, e.g., table data sorted by idLimits the max number of threads used for current query processing. Default - no limit (the query can occupy all threads as defined globally).
For a batch of queries, the option must be attached to the very first query in the batch, and it is then applied when the working queue is created and is effective for the entire batch. This option has the same meaning as the option max_threads_per_query, but is applied only to the current query or batch of queries.
Quoted, colon-separated string of library name:plugin name:optional string of settings. A query-time token filter is created for each search when full-text is invoked by every table involved, allowing you to implement a custom tokenizer that generates tokens according to custom rules.
SELECT * FROM index WHERE MATCH ('yes@no') OPTION token_filter='mylib.so:blend:@'
Restricts the maximum number of expanded keywords for a single wildcard, with a default value of 0 indicating no limit. For additional details, refer to expansion_limit.
In rare cases, Manticore's built-in query analyzer may be incorrect in understanding a query and determining whether a docid index, secondary indexes, or columnar scan should be used. To override the query optimizer's decisions, you can use the following hints in your query:
/*+ DocidIndex(id) */ to force the use of a docid index, /*+ NO_DocidIndex(id) */ to tell the optimizer to ignore it/*+ SecondaryIndex(<attr_name1>[, <attr_nameN>]) */ to force the use of a secondary index (if available), /*+ NO_SecondaryIndex(id) */ to tell the optimizer to ignore it/*+ ColumnarScan(<attr_name1>[, <attr_nameN>]) */ to force the use of a columnar scan (if the attribute is columnar), /*+ NO_ColumnarScan(id) */ to tell the optimizer to ignore itNote that when executing a full-text query with filters, the query optimizer decides between intersecting the results of the full-text tree with the filter results or using a standard match-then-filter approach. Specifying any hint will force the daemon to use the code path that performs the intersection of the full-text tree results with the filter results.
For more information on how the query optimizer works, refer to the Cost based optimizer page.
SELECT * FROM students where age > 21 /*+ SecondaryIndex(age) */
When using a MySQL/MariaDB client, make sure to include the --comments flag to enable the hints in your queries.
mysql -P9306 -h0 --comments
Highlighting enables you to obtain highlighted text fragments (referred to as snippets) from documents containing matching keywords.
The SQL HIGHLIGHT() function, the "highlight" property in JSON queries via HTTP, and the highlight() function in the PHP client all utilize the built-in document storage to retrieve the original field contents (enabled by default).
SELECT HIGHLIGHT() FROM books WHERE MATCH('try');
+----------------------------------------------------------+
| highlight() |
+----------------------------------------------------------+
| Don`t <b>try</b> to compete in childishness, said Bliss. |
+----------------------------------------------------------+
1 row in set (0.00 sec)
POST /search
{
"index": "books",
"query": { "match": { "*" : "try" } },
"highlight": {}
}
{
"took":1,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[
{
"_id":"4",
"_score":1704,
"_source":
{
"title":"Book four",
"content":"Don`t try to compete in childishness, said Bliss."
},
"highlight":
{
"title": ["Book four"],
"content": ["Don`t <b>try</b> to compete in childishness, said Bliss."]
}
}
]
}
}
$results = $index->search('try')->highlight()->get();
foreach($results as $doc)
{
echo 'Document: '.$doc->getId();
foreach($doc->getData() as $field=>$value)
{
echo $field.': '.$value;
}
foreach($doc->getHighlight() as $field=>$snippets)
{
echo "Highlight for ".$field.":\n";
foreach($snippets as $snippet)
{
echo "- ".$snippet."\n";
}
}
}
Document: 14
title: Book four
content: Don`t try to compete in childishness, said Bliss.
Highlight for title:
- Book four
Highlight for content:
- Don`t <b>try</b> to compete in childishness, said Bliss.
res = searchApi.search({"index":"books","query":{"match":{"*":"try"}},"highlight":{}})
{'aggregations': None,
'hits': {'hits': [{u'_id': u'4',
u'_score': 1695,
u'_source': {u'content': u'Don`t try to compete in childishness, said Bliss.',
u'title': u'Book four'},
u'highlight': {u'content': [u'Don`t <b>try</b> to compete in childishness, said Bliss.'],
u'title': [u'Book four']}}],
'max_score': None,
'total': 1},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"books","query":{"match":{"*":"try"}},"highlight":{}});
{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"4","_score":1695,"_source":{"title":"Book four","content":"Don`t try to compete in childishness, said Bliss."},"highlight":{"title":["Book four"],"content":["Don`t <b>try</b> to compete in childishness, said Bliss."]}}]}}
searchRequest = new SearchRequest();
searchRequest.setIndex("books");
query = new HashMap<String,Object>();
query.put("match",new HashMap<String,Object>(){{
put("*","try|gets|down|said");
}});
searchRequest.setQuery(query);
highlight = new HashMap<String,Object>(){{
}};
searchRequest.setHighlight(highlight);
searchResponse = searchApi.search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 3
maxScore: null
hits: [{_id=3, _score=1597, _source={title=Book three, content=Trevize whispered, "It gets infantile pleasure out of display. I`d love to knock it down."}, highlight={title=[Book three], content=[, "It <b>gets</b> infantile pleasure , to knock it <b>down</b>."]}}, {_id=4, _score=1563, _source={title=Book four, content=Don`t try to compete in childishness, said Bliss.}, highlight={title=[Book four], content=[Don`t <b>try</b> to compete in childishness, <b>said</b> Bliss.]}}, {_id=5, _score=1514, _source={title=Books two, content=A door opened before them, revealing a small room. Bander said, "Come, half-humans, I want to show you how we live."}, highlight={title=[Books two], content=[ a small room. Bander <b>said</b>, "Come, half-humans, I]}}]
aggregations: null
}
profile: null
}
var searchRequest = new SearchRequest("books");
searchRequest.FulltextFilter = new MatchFilter("*", "try|gets|down|said");
var highlight = new Highlight();
searchRequest.Highlight = highlight;
var searchResponse = searchApi.Search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 3
maxScore: null
hits: [{_id=3, _score=1597, _source={title=Book three, content=Trevize whispered, "It gets infantile pleasure out of display. I`d love to knock it down."}, highlight={title=[Book three], content=[, "It <b>gets</b> infantile pleasure , to knock it <b>down</b>."]}}, {_id=4, _score=1563, _source={title=Book four, content=Don`t try to compete in childishness, said Bliss.}, highlight={title=[Book four], content=[Don`t <b>try</b> to compete in childishness, <b>said</b> Bliss.]}}, {_id=5, _score=1514, _source={title=Books two, content=A door opened before them, revealing a small room. Bander said, "Come, half-humans, I want to show you how we live."}, highlight={title=[Books two], content=[ a small room. Bander <b>said</b>, "Come, half-humans, I]}}]
aggregations: null
}
profile: null
}
res = await searchApi.search({
index: 'test',
query: {
match: {
*: 'Text 1'
}
},
highlight: {}
});
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1"
},
"highlight":
{
"content":
[
"<b>Text 1</b>"
]
}
]}
}
}
matchClause := map[string]interface{} {"*": "Text 1"};
query := map[string]interface{} {"match": matchClause};
searchRequest.SetQuery(query);
highlight := manticoreclient.NewHighlight()
searchRequest.SetHighlight(highlight)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1"
},
"highlight":
{
"content":
[
"<b>Text 1</b>"
]
}
]}
}
}
When using SQL for highlighting search results, you will receive snippets from various fields combined into a single string due to the limitations of the MySQL protocol. You can adjust the concatenation separators with the field_separator and snippet_separator options, as detailed below.
When executing JSON queries through HTTP or using the PHP client, there are no such constraints, and the result set includes an array of fields containing arrays of snippets (without separators).
Keep in mind that snippet generation options like limit, limit_words, and limit_snippets apply to each field individually by default. You can alter this behavior using the limits_per_field option, but it could lead to unwanted results. For example, one field may have matching keywords, but no snippets from that field are included in the result set because they didn't rank as high as snippets from other fields in the highlighting engine.
The highlighting algorithm currently prioritizes better snippets (with closer phrase matches) and then snippets with keywords not yet included in the result. Generally, it aims to highlight the best match for the query and to highlight all query keywords, as allowed by the limits. If no matches are found in the current field, the beginning of the document will be trimmed according to the limits and returned by default. To return an empty string instead, set the allow_empty option to 1.
Highlighting is performed during the so-called post limit stage, which means that snippet generation is deferred not only until the entire final result set is prepared but also after the LIMIT clause is applied. For instance, with a LIMIT 20,10 clause, the HIGHLIGHT() function will be called a maximum of 10 times.
There are several optional highlighting options that can be used to fine-tune snippet generation, which are common to SQL, HTTP, and PHP clients.
A string to insert before a keyword match. The %SNIPPET_ID% macro can be used in this string. The first occurrence of the macro is replaced with an incrementing snippet number within the current snippet. Numbering starts at 1 by default but can be overridden with the start_snippet_id option. %SNIPPET_ID% restarts at the beginning of each new document. The default is <b>.
A string to insert after a keyword match. The default is </b>.
The maximum snippet size, in symbols (codepoints). The default is 256. This is applied per-field by default, see limits_per_field.
Limits the maximum number of words that can be included in the result. Note that this limit applies to all words, not just the matched keywords to highlight. For example, if highlighting Mary and a snippet Mary had a little lamb is selected, it contributes 5 words to this limit, not just 1. The default is 0 (no limit). This is applied per-field by default, see limits_per_field.
Limits the maximum number of snippets that can be included in the result. The default is 0 (no limit). This is applied per-field by default, see limits_per_field.
Determines whether limit, limit_words, and limit_snippets operate as individual limits in each field of the document being highlighted or as global limits for the entire document. Setting this option to 0 means that all combined highlighting results for one document must be within the specified limits. The downside is that you may have several snippets highlighted in one field and none in another if the highlighting engine decides they are more relevant. The default is 1 (use per-field limits).
The number of words to select around each matching keyword block. The default is 5.
Determines whether to additionally break snippets by phrase boundary characters, as configured in table settings with the phrase_boundary directive. The default is 0 (don't use boundaries).
Specifies whether to sort the extracted snippets in order of relevance (decreasing weight) or in order of appearance in the document (increasing position). The default is 0 (don't use weight order).
Ignores the length limit until the result includes all keywords. The default is 0 (don't force all keywords).
Sets the starting value of the %SNIPPET_ID% macro (which is detected and expanded in before_match, after_match strings). The default is 1.
Defines the HTML stripping mode setting. Defaults to index, meaning that table settings will be used. Other values include none and strip, which forcibly skip or apply stripping regardless of table settings; and retain, which retains HTML markup and protects it from highlighting. The retain mode can only be used when highlighting full documents and therefore requires that no snippet size limits are set. The allowed string values are none, strip, index, and retain.
Permits an empty string to be returned as the highlighting result when no snippets could be generated in the current field (no keyword match or no snippets fit the limit). By default, the beginning of the original text would be returned instead of an empty string. The default is 0 (don't allow an empty result).
Ensures that snippets do not cross a sentence, paragraph, or zone boundary (when used with a table that has the respective indexing settings enabled). The allowed values are sentence, paragraph, and zone.
Emits an HTML tag with the enclosing zone name before each snippet. The default is 0 (don't emit zone names).
Determines whether to force snippet generation even if limits allow highlighting the entire text. The default is 0 (don't force snippet generation).
SELECT HIGHLIGHT({limit=50}) FROM books WHERE MATCH('try|gets|down|said');
+---------------------------------------------------------------------------+
| highlight({limit=50}) |
+---------------------------------------------------------------------------+
| ... , "It <b>gets</b> infantile pleasure ... to knock it <b>down</b>." |
| Don`t <b>try</b> to compete in childishness, <b>said</b> Bliss. |
| ... a small room. Bander <b>said</b>, "Come, half-humans, I ... |
+---------------------------------------------------------------------------+
3 rows in set (0.00 sec)
POST /search
{
"index": "books",
"query": {"query_string": "try|gets|down|said"},
"highlight": { "limit":50 }
}
{
"took":2,
"timed_out":false,
"hits":
{
"total":3,
"hits":
[
{
"_id":"3",
"_score":1602,
"_source":
{
"title":"Book three",
"content":"Trevize whispered, \"It gets infantile pleasure out of display. I`d love to knock it down.\""
},
"highlight":
{
"title":
[
"Book three"
],
"content":
[
", \"It <b>gets</b> infantile pleasure ",
" to knock it <b>down</b>.\""
]
}
},
{
"_id":"4",
"_score":1573,
"_source":
{
"title":"Book four",
"content":"Don`t try to compete in childishness, said Bliss."
},
"highlight":
{
"title":
[
"Book four"
],
"content":
[
"Don`t <b>try</b> to compete in childishness, <b>said</b> Bliss."
]
}
},
{
"_id":"2",
"_score":1521,
"_source":
{
"title":"Book two",
"content":"A door opened before them, revealing a small room. Bander said, \"Come, half-humans, I want to show you how we live.\""
},
"highlight":
{
"title":
[
"Book two"
],
"content":
[
" a small room. Bander <b>said</b>, \"Come, half-humans, I"
]
}
}
]
}
}
$results = $index->search('try|gets|down|said')->highlight([],['limit'=>50])->get();
foreach($results as $doc)
{
echo 'Document: '.$doc->getId();
foreach($doc->getData() as $field=>$value)
{
echo $field.': '.$value;
}
foreach($doc->getHighlight() as $field=>$snippets)
{
echo "Highlight for ".$field.":\n";
foreach($snippets as $snippet)
{
echo $snippet."\n";
}
}
}
Document: 3
title: Book three
content: Trevize whispered, "It gets infantile pleasure out of display. I`d love to knock it down."
Highlight for title:
- Book four
Highlight for content:
, "It <b>gets</b> infantile pleasure
to knock it <b>down</b>."
Document: 4
title: Book four
content: Don`t try to compete in childishness, said Bliss.
Highlight for title:
- Book four
Highlight for content:
Don`t <b>try</b> to compete in childishness, <b>said</b> Bliss.
Document: 2
title: Book two
content: A door opened before them, revealing a small room. Bander said, "Come, half-humans, I want to show you how we live.
Highlight for title:
- Book two
Highlight for content:
a small room. Bander <b>said</b>, \"Come, half-humans, I
res = searchApi.search({"index":"books","query":{"match":{"*":"try"}},"highlight":{"limit":50}})
{'aggregations': None,
'hits': {'hits': [{u'_id': u'4',
u'_score': 1695,
u'_source': {u'content': u'Don`t try to compete in childishness, said Bliss.',
u'title': u'Book four'},
u'highlight': {u'content': [u'Don`t <b>try</b> to compete in childishness, said Bliss.'],
u'title': [u'Book four']}}],
'max_score': None,
'total': 1},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"books","query":{"query_string":"try|gets|down|said"},"highlight":{"limit":50}});
{"took":0,"timed_out":false,"hits":{"total":3,"hits":[{"_id":"3","_score":1597,"_source":{"title":"Book three","content":"Trevize whispered, \"It gets infantile pleasure out of display. I`d love to knock it down.\""},"highlight":{"title":["Book three"],"content":[", \"It <b>gets</b> infantile pleasure "," to knock it <b>down</b>.\""]}},{"_id":"4","_score":1563,"_source":{"title":"Book four","content":"Don`t try to compete in childishness, said Bliss."},"highlight":{"title":["Book four"],"content":["Don`t <b>try</b> to compete in childishness, <b>said</b> Bliss."]}},{"_id":"5","_score":1514,"_source":{"title":"Books two","content":"A door opened before them, revealing a small room. Bander said, \"Come, half-humans, I want to show you how we live.\""},"highlight":{"title":["Books two"],"content":[" a small room. Bander <b>said</b>, \"Come, half-humans, I"]}}]}}
searchRequest = new SearchRequest();
searchRequest.setIndex("books");
query = new HashMap<String,Object>();
query.put("match",new HashMap<String,Object>(){{
put("*","try|gets|down|said");
}});
searchRequest.setQuery(query);
highlight = new HashMap<String,Object>(){{
put("limit",50);
}};
searchRequest.setHighlight(highlight);
searchResponse = searchApi.search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 3
maxScore: null
hits: [{_id=3, _score=1597, _source={title=Book three, content=Trevize whispered, "It gets infantile pleasure out of display. I`d love to knock it down."}, highlight={title=[Book three], content=[, "It <b>gets</b> infantile pleasure , to knock it <b>down</b>."]}}, {_id=4, _score=1563, _source={title=Book four, content=Don`t try to compete in childishness, said Bliss.}, highlight={title=[Book four], content=[Don`t <b>try</b> to compete in childishness, <b>said</b> Bliss.]}}, {_id=5, _score=1514, _source={title=Books two, content=A door opened before them, revealing a small room. Bander said, "Come, half-humans, I want to show you how we live."}, highlight={title=[Books two], content=[ a small room. Bander <b>said</b>, "Come, half-humans, I]}}]
aggregations: null
}
profile: null
}
var searchRequest = new SearchRequest("books");
searchRequest.FulltextFilter = new MatchFilter("*", "try|gets|down|said");
var highlight = new Highlight();
highlight.Limit = 50;
searchRequest.Highlight = highlight;
var searchResponse = searchApi.Search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 3
maxScore: null
hits: [{_id=3, _score=1597, _source={title=Book three, content=Trevize whispered, "It gets infantile pleasure out of display. I`d love to knock it down."}, highlight={title=[Book three], content=[, "It <b>gets</b> infantile pleasure , to knock it <b>down</b>."]}}, {_id=4, _score=1563, _source={title=Book four, content=Don`t try to compete in childishness, said Bliss.}, highlight={title=[Book four], content=[Don`t <b>try</b> to compete in childishness, <b>said</b> Bliss.]}}, {_id=5, _score=1514, _source={title=Books two, content=A door opened before them, revealing a small room. Bander said, "Come, half-humans, I want to show you how we live."}, highlight={title=[Books two], content=[ a small room. Bander <b>said</b>, "Come, half-humans, I]}}]
aggregations: null
}
profile: null
}
res = await searchApi.search({
index: 'test',
query: { match: { *: 'Text } },
highlight: { limit: 2}
});
{
"took":0,
"timed_out":false,
"hits":
{
"total":2,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text 1</b>"
]
}
},
{
"_id":"2",
"_score":1480,
"_source":
{
"content":"Text 2",
"name":"Doc 2",
"cat":2
},
"highlight":
{
"content":
[
"<b>Text 2</b>"
]
}
}]
}
}
matchClause := map[string]interface{} {"*": "Text 1"};
query := map[string]interface{} {"match": matchClause};
searchRequest.SetQuery(query);
highlight := manticoreclient.NewHighlight()
searchRequest.SetHighlight(highlight)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took":0,
"timed_out":false,
"hits":
{
"total":2,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text 1</b>"
]
}
},
{
"_id":"2",
"_score":1480,
"_source":
{
"content":"Text 2",
"name":"Doc 2",
"cat":2
},
"highlight":
{
"content":
[
"<b>Text 2</b>"
]
}
}]
}
}
The HIGHLIGHT() function can be used to highlight search results. Here's the syntax:
HIGHLIGHT([options], [field_list], [query] )
By default, it works with no arguments.
SELECT HIGHLIGHT() FROM books WHERE MATCH('before');
+-----------------------------------------------------------+
| highlight() |
+-----------------------------------------------------------+
| A door opened <b>before</b> them, revealing a small room. |
+-----------------------------------------------------------+
1 row in set (0.00 sec)
HIGHLIGHT() retrieves all available full-text fields from document storage and highlights them against the provided query. Field syntax in queries is supported. Field text is separated by field_separator, which can be modified in the options.
SELECT HIGHLIGHT() FROM books WHERE MATCH('@title one');
+-----------------+
| highlight() |
+-----------------+
| Book <b>one</b> |
+-----------------+
1 row in set (0.00 sec)
Optional first argument in HIGHLIGHT() is the list of options.
SELECT HIGHLIGHT({before_match='[match]',after_match='[/match]'}) FROM books WHERE MATCH('@title one');
+------------------------------------------------------------+
| highlight({before_match='[match]',after_match='[/match]'}) |
+------------------------------------------------------------+
| Book [match]one[/match] |
+------------------------------------------------------------+
1 row in set (0.00 sec)
The optional second argument is a string containing a single field or a comma-separated list of fields. If this argument is present, only the specified fields will be fetched from document storage and highlighted. An empty string as the second argument signifies "fetch all available fields."
SELECT HIGHLIGHT({},'title,content') FROM books WHERE MATCH('one|robots');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| highlight({},'title,content') |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Book <b>one</b> | They followed Bander. The <b>robots</b> remained at a polite distance, but their presence was a constantly felt threat. |
| Bander ushered all three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander gestured the other <b>robots</b> away and entered itself. The door closed behind it. |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)
Alternatively, you can use the second argument to specify a string attribute or field name without quotes. In this case, the supplied string will be highlighted against the provided query, but field syntax will be ignored.
SELECT HIGHLIGHT({}, title) FROM books WHERE MATCH('one');
+---------------------+
| highlight({},title) |
+---------------------+
| Book <b>one</b> |
| Book five |
+---------------------+
2 rows in set (0.00 sec)
The optional third argument is the query. This is used to highlight search results against a query different from the one used for searching.
SELECT HIGHLIGHT({},'title', 'five') FROM books WHERE MATCH('one');
+-------------------------------+
| highlight({},'title', 'five') |
+-------------------------------+
| Book one |
| Book <b>five</b> |
+-------------------------------+
2 rows in set (0.00 sec)
Although HIGHLIGHT() is designed to work with stored full-text fields and string attributes, it can also be used to highlight arbitrary text. Keep in mind that if the query contains any field search operators (e.g., @title hello @body world), the field part of them is ignored in this case.
SELECT HIGHLIGHT({},TO_STRING('some text to highlight'), 'highlight') FROM books WHERE MATCH('@title one');
+----------------------------------------------------------------+
| highlight({},TO_STRING('some text to highlight'), 'highlight') |
+----------------------------------------------------------------+
| some text to <b>highlight</b> |
+----------------------------------------------------------------+
1 row in set (0.00 sec)
Several options are relevant only when generating a single string as a result (not an array of snippets). This applies exclusively to the SQL HIGHLIGHT() function:
A string to insert between snippets. The default is ....
A string to insert between fields. The default is |.
Another way to highlight text is to use the CALL SNIPPETS statement. This mostly duplicates the HIGHLIGHT() functionality but cannot use built-in document storage. However, it can load source text from files.
To highlight full-text search results in JSON queries via HTTP, field contents must be stored in document storage (enabled by default). In the example, full-text fields content and title are fetched from document storage and highlighted against the query specified in the query clause.
Highlighted snippets are returned in the highlight property of the hits array.
POST /search
{
"index": "books",
"query": { "match": { "*": "one|robots" } },
"highlight":
{
"fields": ["content"]
}
}
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1,
"hits": [
{
"_id": "1",
"_score": 2788,
"_source": {
"title": "Books one",
"content": "They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. "
},
"highlight": {
"content": [
"They followed Bander. The <b>robots</b> remained at a polite distance, ",
" three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander",
" gestured the other <b>robots</b> away and entered itself. The"
]
}
}
]
}
}
$index->setName('books');
$results = $index->search('one|robots')->highlight(['content'])->get();
foreach($results as $doc)
{
echo 'Document: '.$doc->getId()."\n";
foreach($doc->getData() as $field=>$value)
{
echo $field.' : '.$value."\n";
}
foreach($doc->getHighlight() as $field=>$snippets)
{
echo "Highlight for ".$field.":\n";
foreach($snippets as $snippet)
{
echo "- ".$snippet."\n";
}
}
}
Document: 1
title : Books one
content : They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it.
Highlight for content:
- They followed Bander. The <b>robots</b> remained at a polite distance,
- three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander
- gestured the other <b>robots</b> away and entered itself. The
res = searchApi.search({"index":"books","query":{"match":{"*":"one|robots"}},"highlight":{"fields":["content"]}}))
{'aggregations': None,
'hits': {'hits': [{u'_id': u'1',
u'_score': 2788,
u'_source': {u'content': u'They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. ',
u'title': u'Books one'},
u'highlight': {u'content': [u'They followed Bander. The <b>robots</b> remained at a polite distance, ',
u' three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander',
u' gestured the other <b>robots</b> away and entered itself. The']}}],
'max_score': None,
'total': 1},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"books","query":{"match":{"*":"one|robots"}},"highlight":{"fields":["content"]}});
{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"1","_score":2788,"_source":{"title":"Books one","content":"They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. "},"highlight":{"content":["They followed Bander. The <b>robots</b> remained at a polite distance, "," three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander"," gestured the other <b>robots</b> away and entered itself. The"]}}]}}
searchRequest = new SearchRequest();
searchRequest.setIndex("books");
query = new HashMap<String,Object>();
query.put("match",new HashMap<String,Object>(){{
put("*","one|robots");
}});
searchRequest.setQuery(query);
highlight = new HashMap<String,Object>(){{
put("fields",new String[] {"content"});
}};
searchRequest.setHighlight(highlight);
searchResponse = searchApi.search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 1
maxScore: null
hits: [{_id=1, _score=2788, _source={title=Books one, content=They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. }, highlight={title=[Books <b>one</b>], content=[They followed Bander. The <b>robots</b> remained at a polite distance, , three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander, gestured the other <b>robots</b> away and entered itself. The]}}]
aggregations: null
}
profile: null
}
var searchRequest = new SearchRequest("books");
searchRequest.FulltextFilter = new MatchFilter("*", "one|robots");
var highlight = new Highlight();
highlight.Fieldnames = new List<string> {"content"};
searchRequest.Highlight = highlight;
var searchResponse = searchApi.Search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 1
maxScore: null
hits: [{_id=1, _score=2788, _source={title=Books one, content=They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. }, highlight={title=[Books <b>one</b>], content=[They followed Bander. The <b>robots</b> remained at a polite distance, , three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander, gestured the other <b>robots</b> away and entered itself. The]}}]
aggregations: null
}
profile: null
}
res = await searchApi.search({
index: 'test',
query: {
match: {
*: 'Text 1|Text 9'
}
},
highlight: {}
});
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text 1</b>"
]
}
]}
}
}
matchClause := map[string]interface{} {"*": "Text 1|Text 9"};
query := map[string]interface{} {"match": matchClause};
searchRequest.SetQuery(query);
highlight := manticoreclient.NewHighlight()
searchRequest.SetHighlight(highlight)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text 1</b>"
]
}
]}
}
}
To highlight all possible fields, pass an empty object as the highlight property.
POST /search
{
"index": "books",
"query": { "match": { "*": "one|robots" } },
"highlight": {}
}
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1,
"hits": [
{
"_id": "1",
"_score": 2788,
"_source": {
"title": "Books one",
"content": "They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. "
},
"highlight": {
"title": [
"Books <b>one</b>"
],
"content": [
"They followed Bander. The <b>robots</b> remained at a polite distance, ",
" three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander",
" gestured the other <b>robots</b> away and entered itself. The"
]
}
}
]
}
}
$index->setName('books');
$results = $index->search('one|robots')->highlight()->get();
foreach($results as $doc)
{
echo 'Document: '.$doc->getId()."\n";
foreach($doc->getData() as $field=>$value)
{
echo $field.' : '.$value."\n";
}
foreach($doc->getHighlight() as $field=>$snippets)
{
echo "Highlight for ".$field.":\n";
foreach($snippets as $snippet)
{
echo "- ".$snippet."\n";
}
}
}
Document: 1
title : Books one
content : They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it.
Highlight for title:
- Books <b>one</b>
Highlight for content:
- They followed Bander. The <b>robots</b> remained at a polite distance,
- three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander
- gestured the other <b>robots</b> away and entered itself. The
res = searchApi.search({"index":"books","query":{"match":{"*":"one|robots"}},"highlight":{}})
{'aggregations': None,
'hits': {'hits': [{u'_id': u'1',
u'_score': 2788,
u'_source': {u'content': u'They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. ',
u'title': u'Books one'},
u'highlight': {u'content': [u'They followed Bander. The <b>robots</b> remained at a polite distance, ',
u' three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander',
u' gestured the other <b>robots</b> away and entered itself. The'],
u'title': [u'Books <b>one</b>']}}],
'max_score': None,
'total': 1},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"books","query":{"match":{"*":"one|robots"}},"highlight":{}});
{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"1","_score":2788,"_source":{"title":"Books one","content":"They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. "},"highlight":{"title":["Books <b>one</b>"],"content":["They followed Bander. The <b>robots</b> remained at a polite distance, "," three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander"," gestured the other <b>robots</b> away and entered itself. The"]}}]}}
searchRequest = new SearchRequest();
searchRequest.setIndex("books");
query = new HashMap<String,Object>();
query.put("match",new HashMap<String,Object>(){{
put("*","one|robots");
}});
searchRequest.setQuery(query);
highlight = new HashMap<String,Object>(){{
}};
searchRequest.setHighlight(highlight);
searchResponse = searchApi.search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 1
maxScore: null
hits: [{_id=1, _score=2788, _source={title=Books one, content=They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. }, highlight={title=[Books <b>one</b>], content=[They followed Bander. The <b>robots</b> remained at a polite distance, , three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander, gestured the other <b>robots</b> away and entered itself. The]}}]
aggregations: null
}
profile: null
}
var searchRequest = new SearchRequest("books");
searchRequest.FulltextFilter = new MatchFilter("*", "one|robots");
var highlight = new Highlight();
searchRequest.Highlight = highlight;
var searchResponse = searchApi.Search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 1
maxScore: null
hits: [{_id=1, _score=2788, _source={title=Books one, content=They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. }, highlight={title=[Books <b>one</b>], content=[They followed Bander. The <b>robots</b> remained at a polite distance, , three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander, gestured the other <b>robots</b> away and entered itself. The]}}]
aggregations: null
}
profile: null
}
res = await searchApi.search({
index: 'test',
query: {
match: {
*: 'Text 1|Doc 1'
}
},
highlight: {}
});
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text 1</b>"
],
"name":
[
"<b>Doc 1</b>"
]
}
]}
}
}
matchClause := map[string]interface{} {"*": "Text 1|Doc 1"};
query := map[string]interface{} {"match": matchClause};
searchRequest.SetQuery(query);
highlight := manticoreclient.NewHighlight()
searchRequest.SetHighlight(highlight)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text 1</b>"
],
"name":
[
"<b>Doc 1</b>"
]
}
]}
}
}
In addition to common highlighting options, several synonyms are available for JSON queries via HTTP:
The fields object contains attribute names with options. It can also be an array of field names (without any options).
Note that by default, highlighting attempts to highlight the results following the full-text query. In a general case, when you don't specify fields to highlight, the highlight is based on your full-text query. However, if you specify fields to highlight, it highlights only if the full-text query matches the selected fields.
The encoder can be set to default or html. When set to html, it retains HTML markup when highlighting. This works similarly to the html_strip_mode=retain option.
The highlight_query option allows you to highlight against a query other than your search query. The syntax is the same as in the main query.
POST /search
{
"index": "books",
"query": { "match": { "content": "one|robots" } },
"highlight":
{
"fields": [ "content"],
"highlight_query": { "match": { "*":"polite distance" } }
}
}
$index->setName('books');
$bool = new \Manticoresearch\Query\BoolQuery();
$bool->must(new \Manticoresearch\Query\Match(['query' => 'one|robots'], 'content'));
$results = $index->search($bool)->highlight(['content'],['highlight_query'=>['match'=>['*'=>'polite distance']]])->get();
foreach($results as $doc)
{
echo 'Document: '.$doc->getId()."\n";
foreach($doc->getData() as $field=>$value)
{
echo $field.' : '.$value."\n";
}
foreach($doc->getHighlight() as $field=>$snippets)
{
echo "Highlight for ".$field.":\n";
foreach($snippets as $snippet)
{
echo "- ".$snippet."\n";
}
}
}
res = searchApi.search({"index":"books","query":{"match":{"content":"one|robots"}},"highlight":{"fields":["content"],"highlight_query":{"match":{"*":"polite distance"}}}})
{'aggregations': None,
'hits': {'hits': [{u'_id': u'1',
u'_score': 1788,
u'_source': {u'content': u'They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. ',
u'title': u'Books one'},
u'highlight': {u'content': [u'. The robots remained at a <b>polite distance</b>, but their presence was a']}}],
'max_score': None,
'total': 1},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"books","query":{"match":{"content":"one|robots"}},"highlight":{"fields":["content"],"highlight_query":{"match":{"*":"polite distance"}}}});
{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"1","_score":1788,"_source":{"title":"Books one","content":"They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. "},"highlight":{"content":[". The robots remained at a <b>polite distance</b>, but their presence was a"]}}]}}
searchRequest = new SearchRequest();
searchRequest.setIndex("books");
query = new HashMap<String,Object>();
query.put("match",new HashMap<String,Object>(){{
put("*","one|robots");
}});
searchRequest.setQuery(query);
highlight = new HashMap<String,Object>(){{
put("fields",new String[] {"content","title"});
put("highlight_query",
new HashMap<String,Object>(){{
put("match", new HashMap<String,Object>(){{
put("*","polite distance");
}});
}});
}};
searchRequest.setHighlight(highlight);
searchResponse = searchApi.search(searchRequest);
var searchRequest = new SearchRequest("books");
searchRequest.FulltextFilter = new MatchFilter("*", "one|robots");
var highlight = new Highlight();
highlight.Fieldnames = new List<string> {"content", "title"};
Dictionary<string, Object> match = new Dictionary<string, Object>();
match.Add("*", "polite distance");
Dictionary<string, Object> highlightQuery = new Dictionary<string, Object>();
highlightQuery.Add("match", match);
highlight.HighlightQuery = highlightQuery;
searchRequest.Highlight = highlight;
var searchResponse = searchApi.Search(searchRequest);
res = await searchApi.search({
index: 'test',
query: {
match: {
*: 'Text 1'
}
},
highlight: {
fields: ['content'],
highlight_query: {
match: {*: 'Text'}
}
}
});
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text</b> 1"
]
}
]}
}
}
matchClause := map[string]interface{} {"*": "Text 1"};
query := map[string]interface{} {"match": matchClause};
searchRequest.SetQuery(query);
highlight := manticoreclient.NewHighlight()
highlightField := manticoreclient.NetHighlightField("content")
highlightFields := []interface{} { highlightField }
highlight.SetFields(highlightFields)
queryMatchClause := map[string]interface{} {"*": "Text"};
highlightQuery := map[string]interface{} {"match": queryMatchClause};
highlight.SetHighlightQuery(highlightQuery)
searchRequest.SetHighlight(highlight)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text</b> 1"
]
}
]}
}
}
pre_tags and post_tags set the opening and closing tags for highlighted text snippets. They function similarly to the before_match and after_match options. These are optional, with default values of <b> and </b>.
POST /search
{
"index": "books",
"query": { "match": { "*": "one|robots" } },
"highlight":
{
"fields": [ "content", "title" ],
"pre_tags": "before_",
"post_tags": "_after"
}
}
$index->setName('books');
$bool = new \Manticoresearch\Query\BoolQuery();
$bool->must(new \Manticoresearch\Query\Match(['query' => 'one|robots'], '*'));
$results = $index->search($bool)->highlight(['content','title'],['pre_tags'=>'before_','post_tags'=>'_after'])->get();
foreach($results as $doc)
{
echo 'Document: '.$doc->getId()."\n";
foreach($doc->getData() as $field=>$value)
{
echo $field.' : '.$value."\n";
}
foreach($doc->getHighlight() as $field=>$snippets)
{
echo "Highlight for ".$field.":\n";
foreach($snippets as $snippet)
{
echo "- ".$snippet."\n";
}
}
}
Document: 1
title : Books one
content : They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it.
Highlight for content:
- They followed Bander. The before_robots_after remained at a polite distance,
- three into the room. before_One_after of the before_robots_after followed as well. Bander
- gestured the other before_robots_after away and entered itself. The
Highlight for title:
- Books before_one_after
res = searchApi.search({"index":"books","query":{"match":{"*":"one|robots"}},"highlight":{"fields":["content","title"],"pre_tags":"before_","post_tags":"_after"}})
{'aggregations': None,
'hits': {'hits': [{u'_id': u'1',
u'_score': 2788,
u'_source': {u'content': u'They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. ',
u'title': u'Books one'},
u'highlight': {u'content': [u'They followed Bander. The before_robots_after remained at a polite distance, ',
u' three into the room. before_One_after of the before_robots_after followed as well. Bander',
u' gestured the other before_robots_after away and entered itself. The'],
u'title': [u'Books before_one_after']}}],
'max_score': None,
'total': 1},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"books","query":{"match":{"*":"one|robots"}},"highlight":{"fields":["content","title"],"pre_tags":"before_","post_tags":"_after"}});
{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"1","_score":2788,"_source":{"title":"Books one","content":"They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. "},"highlight":{"content":["They followed Bander. The before_robots_after remained at a polite distance, "," three into the room. before_One_after of the before_robots_after followed as well. Bander"," gestured the other before_robots_after away and entered itself. The"],"title":["Books before_one_after"]}}]}}
searchRequest = new SearchRequest();
searchRequest.setIndex("books");
query = new HashMap<String,Object>();
query.put("match",new HashMap<String,Object>(){{
put("*","one|robots");
}});
searchRequest.setQuery(query);
highlight = new HashMap<String,Object>(){{
put("fields",new String[] {"content","title"});
put("pre_tags","before_");
put("post_tags","_after");
}};
searchRequest.setHighlight(highlight);
searchResponse = searchApi.search(searchRequest);
var searchRequest = new SearchRequest("books");
searchRequest.FulltextFilter = new MatchFilter("*", "one|robots");
var highlight = new Highlight();
highlight.Fieldnames = new List<string> {"content", "title"};
highlight.PreTags = "before_";
highlight.PostTags = "_after";
searchRequest.Highlight = highlight;
var searchResponse = searchApi.Search(searchRequest);
res = await searchApi.search({
index: 'test',
query: {
match: {
*: 'Text 1'
}
},
highlight: {
pre_tags: 'before_',
post_tags: '_after'
}
});
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"before_Text 1_after"
]
}
]}
}
}
matchClause := map[string]interface{} {"*": "Text 1"}
query := map[string]interface{} {"match": matchClause}
searchRequest.SetQuery(query)
highlight := manticoreclient.NewHighlight()
highlight.SetPreTags("before_")
highlight.SetPostTags("_after")
searchRequest.SetHighlight(highlight)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"before_Text 1_after"
]
}
]}
}
}
no_match_size functions similarly to the allow_empty option. If set to 0, it acts as allow_empty=1, allowing an empty string to be returned as a highlighting result when a snippet could not be generated. Otherwise, the beginning of the field will be returned. This is optional, with a default value of 1.
POST /search
{
"index": "books",
"query": { "match": { "*": "one|robots" } },
"highlight":
{
"fields": [ "content", "title" ],
"no_match_size": 0
}
}
$index->setName('books');
$bool = new \Manticoresearch\Query\BoolQuery();
$bool->must(new \Manticoresearch\Query\Match(['query' => 'one|robots'], '*'));
$results = $index->search($bool)->highlight(['content','title'],['no_match_size'=>0])->get();
foreach($results as $doc)
{
echo 'Document: '.$doc->getId()."\n";
foreach($doc->getData() as $field=>$value)
{
echo $field.' : '.$value."\n";
}
foreach($doc->getHighlight() as $field=>$snippets)
{
echo "Highlight for ".$field.":\n";
foreach($snippets as $snippet)
{
echo "- ".$snippet."\n";
}
}
}
Document: 1
title : Books one
content : They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it.
Highlight for content:
- They followed Bander. The <b>robots</b> remained at a polite distance,
- three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander
- gestured the other <b>robots</b> away and entered itself. The
Highlight for title:
- Books <b>one</b>
res = searchApi.search({"index":"books","query":{"match":{"*":"one|robots"}},"highlight":{"fields":["content","title"],"no_match_size":0}})
{'aggregations': None,
'hits': {'hits': [{u'_id': u'1',
u'_score': 2788,
u'_source': {u'content': u'They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. ',
u'title': u'Books one'},
u'highlight': {u'content': [u'They followed Bander. The <b>robots</b> remained at a polite distance, ',
u' three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander',
u' gestured the other <b>robots</b> away and entered itself. The'],
u'title': [u'Books <b>one</b>']}}],
'max_score': None,
'total': 1},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"books","query":{"match":{"*":"one|robots"}},"highlight":{"fields":["content","title"],"no_match_size":0}});
{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"1","_score":2788,"_source":{"title":"Books one","content":"They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. "},"highlight":{"content":["They followed Bander. The <b>robots</b> remained at a polite distance, "," three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander"," gestured the other <b>robots</b> away and entered itself. The"],"title":["Books <b>one</b>"]}}]}}
searchRequest = new SearchRequest();
searchRequest.setIndex("books");
query = new HashMap<String,Object>();
query.put("match",new HashMap<String,Object>(){{
put("*","one|robots");
}});
searchRequest.setQuery(query);
highlight = new HashMap<String,Object>(){{
put("fields",new String[] {"content","title"});
put("no_match_size",0);
}};
searchRequest.setHighlight(highlight);
searchResponse = searchApi.search(searchRequest);
var searchRequest = new SearchRequest("books");
searchRequest.FulltextFilter = new MatchFilter("*", "one|robots");
var highlight = new Highlight();
highlight.Fieldnames = new List<string> {"content", "title"};
highlight.NoMatchSize = 0;
searchRequest.Highlight = highlight;
var searchResponse = searchApi.Search(searchRequest);
res = await searchApi.search({
index: 'test',
query: {
match: {
*: 'Text 1'
}
},
highlight: {no_match_size: 0}
});
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text 1</b>"
]
}
]}
}
}
matchClause := map[string]interface{} {"*": "Text 1"};
query := map[string]interface{} {"match": matchClause};
searchRequest.SetQuery(query);
highlight := manticoreclient.NewHighlight()
highlight.SetNoMatchSize(0)
searchRequest.SetHighlight(highlight)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text 1</b>"
]
}
]}
}
}
order sets the sorting order of extracted snippets. If set to "score", it sorts the extracted snippets in order of relevance. This is optional and works similarly to the weight_order option.
POST /search
{
"index": "books",
"query": { "match": { "*": "one|robots" } },
"highlight":
{
"fields": [ "content", "title" ],
"order": "score"
}
}
$index->setName('books');
$bool = new \Manticoresearch\Query\BoolQuery();
$bool->must(new \Manticoresearch\Query\Match(['query' => 'one|robots'], '*'));
$results = $index->search($bool)->highlight(['content','title'],['order'=>"score"])->get();
foreach($results as $doc)
{
echo 'Document: '.$doc->getId()."\n";
foreach($doc->getData() as $field=>$value)
{
echo $field.' : '.$value."\n";
}
foreach($doc->getHighlight() as $field=>$snippets)
{
echo "Highlight for ".$field.":\n";
foreach($snippets as $snippet)
{
echo "- ".$snippet."\n";
}
}
}
Document: 1
title : Books one
content : They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it.
Highlight for content:
- three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander
- gestured the other <b>robots</b> away and entered itself. The
- They followed Bander. The <b>robots</b> remained at a polite distance,
Highlight for title:
- Books <b>one</b>
res = searchApi.search({"index":"books","query":{"match":{"*":"one|robots"}},"highlight":{"fields":["content","title"],"order":"score"}})
{'aggregations': None,
'hits': {'hits': [{u'_id': u'1',
u'_score': 2788,
u'_source': {u'content': u'They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. ',
u'title': u'Books one'},
u'highlight': {u'content': [u' three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander',
u' gestured the other <b>robots</b> away and entered itself. The',
u'They followed Bander. The <b>robots</b> remained at a polite distance, '],
u'title': [u'Books <b>one</b>']}}],
'max_score': None,
'total': 1},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"books","query":{"match":{"*":"one|robots"}},"highlight":{"fields":["content","title"],"order":"score"}});
{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"1","_score":2788,"_source":{"title":"Books one","content":"They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. "},"highlight":{"content":[" three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander"," gestured the other <b>robots</b> away and entered itself. The","They followed Bander. The <b>robots</b> remained at a polite distance, "],"title":["Books <b>one</b>"]}}]}}
searchRequest = new SearchRequest();
searchRequest.setIndex("books");
query = new HashMap<String,Object>();
query.put("match",new HashMap<String,Object>(){{
put("*","one|robots");
}});
searchRequest.setQuery(query);
highlight = new HashMap<String,Object>(){{
put("fields",new String[] {"content","title"});
put("order","score");
}};
searchRequest.setHighlight(highlight);
searchResponse = searchApi.search(searchRequest);
var searchRequest = new SearchRequest("books");
searchRequest.FulltextFilter = new MatchFilter("*", "one|robots");
var highlight = new Highlight();
highlight.Fieldnames = new List<string> {"content", "title"};
highlight.Order = "score";
searchRequest.Highlight = highlight;
var searchResponse = searchApi.Search(searchRequest);
res = await searchApi.search({
index: 'test',
query: {
match: {
*: 'Text 1'
}
},
highlight: { order: 'score' }
});
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text 1</b>"
]
}
]}
}
}
matchClause := map[string]interface{} {"*": "Text 1"};
query := map[string]interface{} {"match": matchClause};
searchRequest.SetQuery(query);
highlight := manticoreclient.NewHighlight()
highlight.SetOrder("score")
searchRequest.SetHighlight(highlight)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text 1</b>"
]
}
]}
}
}
fragment_size sets the maximum snippet size in symbols. It can be global or per-field. Per-field options override global options. This is optional, with a default value of 256. It works similarly to the limit option.
POST /search
{
"index": "books",
"query": { "match": { "*": "one|robots" } },
"highlight":
{
"fields": [ "content", "title" ],
"fragment_size": 100
}
}
$index->setName('books');
$bool = new \Manticoresearch\Query\BoolQuery();
$bool->must(new \Manticoresearch\Query\Match(['query' => 'one|robots'], '*'));
$results = $index->search($bool)->highlight(['content','title'],['fragment_size'=>100])->get();
foreach($results as $doc)
{
echo 'Document: '.$doc->getId()."\n";
foreach($doc->getData() as $field=>$value)
{
echo $field.' : '.$value."\n";
}
foreach($doc->getHighlight() as $field=>$snippets)
{
echo "Highlight for ".$field.":\n";
foreach($snippets as $snippet)
{
echo "- ".$snippet."\n";
}
}
}
Document: 1
title : Books one
content : They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it.
Highlight for content:
- the room. <b>One</b> of the <b>robots</b> followed as well
- Bander gestured the other <b>robots</b> away and entered
Highlight for title:
- Books <b>one</b>
res = searchApi.search({"index":"books","query":{"match":{"*":"one|robots"}},"highlight":{"fields":["content","title"],"fragment_size":100}})
{'aggregations': None,
'hits': {'hits': [{u'_id': u'1',
u'_score': 2788,
u'_source': {u'content': u'They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. ',
u'title': u'Books one'},
u'highlight': {u'content': [u' the room. <b>One</b> of the <b>robots</b> followed as well',
u'Bander gestured the other <b>robots</b> away and entered '],
u'title': [u'Books <b>one</b>']}}],
'max_score': None,
'total': 1},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"books","query":{"match":{"*":"one|robots"}},"highlight":{"fields":["content","title"],"fragment_size":100}});
{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"1","_score":2788,"_source":{"title":"Books one","content":"They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. "},"highlight":{"content":[" the room. <b>One</b> of the <b>robots</b> followed as well","Bander gestured the other <b>robots</b> away and entered "],"title":["Books <b>one</b>"]}}]}}
searchRequest = new SearchRequest();
searchRequest.setIndex("books");
query = new HashMap<String,Object>();
query.put("match",new HashMap<String,Object>(){{
put("*","one|robots");
}});
searchRequest.setQuery(query);
highlight = new HashMap<String,Object>(){{
put("fields",new String[] {"content","title"});
put("fragment_size",100);
}};
searchRequest.setHighlight(highlight);
searchResponse = searchApi.search(searchRequest);
var searchRequest = new SearchRequest("books");
searchRequest.FulltextFilter = new MatchFilter("*", "one|robots");
var highlight = new Highlight();
highlight.Fieldnames = new List<string> {"content", "title"};
highlight.FragmentSize = 100;
searchRequest.Highlight = highlight;
var searchResponse = searchApi.Search(searchRequest);
res = await searchApi.search({
index: 'test',
query: {
match: {
*: 'Text 1'
}
},
highlight: { fragment_size: 4}
});
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text</b>"
]
}
]}
}
}
matchClause := map[string]interface{} {"*": "Text 1"};
query := map[string]interface{} {"match": matchClause};
searchRequest.SetQuery(query);
highlight := manticoreclient.NewHighlight()
highlight.SetFragmentSize(4)
searchRequest.SetHighlight(highlight)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text</b>"
]
}
]}
}
}
number_of_fragments limits the maximum number of snippets in the result. Like fragment_size, it can be global or per-field. This is optional, with a default value of 0 (no limit). It works similarly to the limit_snippets option.
POST /search
{
"index": "books",
"query": { "match": { "*": "one|robots" } },
"highlight":
{
"fields": [ "content", "title" ],
"number_of_fragments": 10
}
}
$index->setName('books');
$bool = new \Manticoresearch\Query\BoolQuery();
$bool->must(new \Manticoresearch\Query\Match(['query' => 'one|robots'], '*'));
$results = $index->search($bool)->highlight(['content','title'],['number_of_fragments'=>10])->get();
foreach($results as $doc)
{
echo 'Document: '.$doc->getId()."\n";
foreach($doc->getData() as $field=>$value)
{
echo $field.' : '.$value."\n";
}
foreach($doc->getHighlight() as $field=>$snippets)
{
echo "Highlight for ".$field.":\n";
foreach($snippets as $snippet)
{
echo "- ".$snippet."\n";
}
}
}
Document: 1
title : Books one
content : They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it.
Highlight for content:
- They followed Bander. The <b>robots</b> remained at a polite distance,
- three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander
- gestured the other <b>robots</b> away and entered itself. The
Highlight for title:
- Books <b>one</b>
res =searchApi.search({"index":"books","query":{"match":{"*":"one|robots"}},"highlight":{"fields":["content","title"],"number_of_fragments":10}})
{'aggregations': None,
'hits': {'hits': [{u'_id': u'1',
u'_score': 2788,
u'_source': {u'content': u'They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. ',
u'title': u'Books one'},
u'highlight': {u'content': [u'They followed Bander. The <b>robots</b> remained at a polite distance, ',
u' three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander',
u' gestured the other <b>robots</b> away and entered itself. The'],
u'title': [u'Books <b>one</b>']}}],
'max_score': None,
'total': 1},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"books","query":{"match":{"*":"one|robots"}},"highlight":{"fields":["content","title"],"number_of_fragments":10}});
{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"1","_score":2788,"_source":{"title":"Books one","content":"They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. "},"highlight":{"content":["They followed Bander. The <b>robots</b> remained at a polite distance, "," three into the room. <b>One</b> of the <b>robots</b> followed as well. Bander"," gestured the other <b>robots</b> away and entered itself. The"],"title":["Books <b>one</b>"]}}]}}
searchRequest = new SearchRequest();
searchRequest.setIndex("books");
query = new HashMap<String,Object>();
query.put("match",new HashMap<String,Object>(){{
put("*","one|robots");
}});
searchRequest.setQuery(query);
highlight = new HashMap<String,Object>(){{
put("fields",new String[] {"content","title"});
put("number_of_fragments",10);
}};
searchRequest.setHighlight(highlight);
searchResponse = searchApi.search(searchRequest);
var searchRequest = new SearchRequest("books");
searchRequest.FulltextFilter = new MatchFilter("*", "one|robots");
var highlight = new Highlight();
highlight.Fieldnames = new List<string> {"content", "title"};
highlight.NumberOfFragments = 10;
searchRequest.Highlight = highlight;
var searchResponse = searchApi.Search(searchRequest);
res = await searchApi.search({
index: 'test',
query: {
match: {
*: 'Text 1'
}
},
highlight: { number_of_fragments: 1}
});
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text 1</b>"
]
}
]}
}
}
matchClause := map[string]interface{} {"*": "Text 1"};
query := map[string]interface{} {"match": matchClause};
searchRequest.SetQuery(query);
highlight := manticoreclient.NewHighlight()
highlight.SetNumberOfFragments(1)
searchRequest.SetHighlight(highlight)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text 1</b>"
]
}
]}
}
}
Options like limit, limit_words, and limit_snippets can be set as global or per-field options. Global options are used as per-field limits unless per-field options override them. In the example, the title field is highlighted with default limit settings, while the content field uses a different limit.
POST /search
{
"index": "books",
"query": { "match": { "*": "one|robots" } },
"highlight":
{
"fields":
{
"title": {},
"content" : { "limit": 50 }
}
}
}
$index->setName('books');
$bool = new \Manticoresearch\Query\BoolQuery();
$bool->must(new \Manticoresearch\Query\Match(['query' => 'one|robots'], '*'));
$results = $index->search($bool)->highlight(['content'=>['limit'=>50],'title'=>new \stdClass])->get();
foreach($results as $doc)
{
echo 'Document: '.$doc->getId()."\n";
foreach($doc->getData() as $field=>$value)
{
echo $field.' : '.$value."\n";
}
foreach($doc->getHighlight() as $field=>$snippets)
{
echo "Highlight for ".$field.":\n";
foreach($snippets as $snippet)
{
echo "- ".$snippet."\n";
}
}
}
Document: 1
title : Books one
content : They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it.
Highlight for content:
- into the room. <b>One</b> of the <b>robots</b> followed as well
Highlight for title:
- Books <b>one</b>
res =searchApi.search({"index":"books","query":{"match":{"*":"one|robots"}},"highlight":{"fields":{"title":{},"content":{"limit":50}}}})
{'aggregations': None,
'hits': {'hits': [{u'_id': u'1',
u'_score': 2788,
u'_source': {u'content': u'They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. ',
u'title': u'Books one'},
u'highlight': {u'content': [u' into the room. <b>One</b> of the <b>robots</b> followed as well'],
u'title': [u'Books <b>one</b>']}}],
'max_score': None,
'total': 1},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"books","query":{"match":{"*":"one|robots"}},"highlight":{"fields":{"title":{},"content":{"limit":50}}}});
{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"1","_score":2788,"_source":{"title":"Books one","content":"They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. "},"highlight":{"title":["Books <b>one</b>"],"content":[" into the room. <b>One</b> of the <b>robots</b> followed as well"]}}]}}
searchRequest = new SearchRequest();
searchRequest.setIndex("books");
query = new HashMap<String,Object>();
query.put("match",new HashMap<String,Object>(){{
put("*","one|robots");
}});
searchRequest.setQuery(query);
highlight = new HashMap<String,Object>(){{
put("fields",new HashMap<String,Object>(){{
put("title",new HashMap<String,Object>(){{}});
put("content",new HashMap<String,Object>(){{
put("limit",50);
}});
}}
);
}};
searchRequest.setHighlight(highlight);
searchResponse = searchApi.search(searchRequest);
var searchRequest = new SearchRequest("books");
searchRequest.FulltextFilter = new MatchFilter("*", "one|robots");
var highlight = new Highlight();
var highlightField = new HighlightField("title");
highlightField.Limit = 50;
highlight.Fields = new List<Object> {highlightField};
searchRequest.Highlight = highlight;
var searchResponse = searchApi.Search(searchRequest);
res = await searchApi.search({
index: 'test',
query: {
match: {
*: 'Text 1'
}
},
highlight: {
fields: {
content: { limit:1 }
}
}
});
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text</b>"
]
}
]}
}
}
matchClause := map[string]interface{} {"*": "Text 1"};
query := map[string]interface{} {"match": matchClause};
searchRequest.SetQuery(query);
highlight := manticoreclient.NewHighlight()
highlightField := manticoreclient.NetHighlightField("content")
highlightField.SetLimit(1);
highlightFields := []interface{} { highlightField }
highlight.SetFields(highlightFields)
searchRequest.SetHighlight(highlight)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"hits":
[{
"_id":"1",
"_score":1480,
"_source":
{
"content":"Text 1",
"name":"Doc 1",
"cat":1
},
"highlight":
{
"content":
[
"<b>Text</b>"
]
}
]}
}
}
Global limits can also be enforced by specifying limits_per_field=0. Setting this option means that all combined highlighting results must be within the specified limits. The downside is that you may get several snippets highlighted in one field and none in another if the highlighting engine decides that they are more relevant.
POST /search
{
"index": "books",
"query": { "match": { "content": "and first" } },
"highlight":
{
"limits_per_field": false,
"fields":
{
"content" : { "limit": 50 }
}
}
}
$index->setName('books');
$bool = new \Manticoresearch\Query\BoolQuery();
$bool->must(new \Manticoresearch\Query\Match(['query' => 'and first'], 'content'));
$results = $index->search($bool)->highlight(['content'=>['limit'=>50]],['limits_per_field'=>false])->get();
foreach($results as $doc)
{
echo 'Document: '.$doc->getId()."\n";
foreach($doc->getData() as $field=>$value)
{
echo $field.' : '.$value."\n";
}
foreach($doc->getHighlight() as $field=>$snippets)
{
echo "Highlight for ".$field.":\n";
foreach($snippets as $snippet)
{
echo "- ".$snippet."\n";
}
}
}
Document: 1
title : Books one
content : They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it.
Highlight for content:
- gestured the other robots away <b>and</b> entered itself. The door closed
res =searchApi.search({"index":"books","query":{"match":{"content":"and first"}},"highlight":{"fields":{"content":{"limit":50}},"limits_per_field":False}})
{'aggregations': None,
'hits': {'hits': [{u'_id': u'1',
u'_score': 1597,
u'_source': {u'content': u'They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. ',
u'title': u'Books one'},
u'highlight': {u'content': [u' gestured the other robots away <b>and</b> entered itself. The door closed']}}],
'max_score': None,
'total': 1},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"books","query":{"match":{"content":"and first"}},"highlight":{"fields":{"content":{"limit":50}},"limits_per_field":false}});
{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"1","_score":1597,"_source":{"title":"Books one","content":"They followed Bander. The robots remained at a polite distance, but their presence was a constantly felt threat. Bander ushered all three into the room. One of the robots followed as well. Bander gestured the other robots away and entered itself. The door closed behind it. "},"highlight":{"content":[" gestured the other robots away <b>and</b> entered itself. The door closed"]}}]}}
searchRequest = new SearchRequest();
searchRequest.setIndex("books");
query = new HashMap<String,Object>();
query.put("match",new HashMap<String,Object>(){{
put("*","one|robots");
}});
searchRequest.setQuery(query);
highlight = new HashMap<String,Object>(){{
put("limits_per_field",0);
put("fields",new HashMap<String,Object>(){{
put("content",new HashMap<String,Object>(){{
put("limit",50);
}});
}}
);
}};
searchRequest.setHighlight(highlight);
searchResponse = searchApi.search(searchRequest);
var searchRequest = new SearchRequest("books");
searchRequest.FulltextFilter = new MatchFilter("*", "one|robots");
var highlight = new Highlight();
highlight.LimitsPerField = 0;
var highlightField = new HighlightField("title");
highlight.Fields = new List<Object> {highlightField};
searchRequest.Highlight = highlight;
var searchResponse = searchApi.Search(searchRequest);
res = await searchApi.search({
index: 'test',
query: {
match: {
*: 'Text 1'
}
},
highlight: { limits_per_field: 0 }
});
matchClause := map[string]interface{} {"*": "Text 1"};
query := map[string]interface{} {"match": matchClause};
searchRequest.SetQuery(query);
highlight := manticoreclient.NewHighlight()
highlight.SetLimitsPerField(0)
searchRequest.SetHighlight(highlight)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
The CALL SNIPPETS statement builds a snippet from provided data and query using specified table settings. It can't access built-in document storage, which is why it's recommended to use the HIGHLIGHT() function instead.
The syntax is:
CALL SNIPPETS(data, table, query[, opt_value AS opt_name[, ...]])
data serves as the source from which a snippet is extracted. It can either be a single string or a list of strings enclosed in curly brackets.
table refers to the name of the table that provides the text processing settings for snippet generation.
query is the full-text query used to build the snippets.
opt_value and opt_name represent the snippet generation options.
CALL SNIPPETS(('this is my document text','this is my another text'), 'forum', 'is text', 5 AS around, 200 AS limit);
+----------------------------------------+
| snippet |
+----------------------------------------+
| this <b>is</b> my document <b>text</b> |
| this <b>is</b> my another <b>text</b> |
+----------------------------------------+
2 rows in set (0.02 sec)
Most options are the same as in the HIGHLIGHT() function. There are, however, several options that can only be used with CALL SNIPPETS.
The following options can be used to highlight text stored in separate files:
This option, when enabled, treats the first argument as file names instead of data to extract snippets from. The specified files on the server side will be loaded for data. Up to max_threads_per_query worker threads per request will be used to parallelize the work when this flag is enabled. Default is 0 (no limit). To distribute snippet generation between remote agents, invoke snippets generation in a distributed table containing only one(!) local agent and several remotes. The snippets_file_prefix option is used to generate the final file name. For example, when searchd is configured with snippets_file_prefix = /var/data_ and text.txt is provided as a file name, snippets will be generated from the content of /var/data_text.txt.
This option only works with distributed snippets generation with remote agents. Source files for snippet generation can be distributed among different agents, and the main server will merge all non-erroneous results. For example, if one agent of the distributed table has file1.txt, another agent has file2.txt, and you use CALL SNIPPETS with both of these files, searchd will merge agent results, so you will get results from both file1.txt and file2.txt. Default is 0.
If the load_files option is also enabled, the request will return an error if any of the files is not available anywhere. Otherwise (if load_files is not enabled), it will return empty strings for all absent files. Searchd does not pass this flag to agents, so agents do not generate a critical error if the file does not exist. If you want to be sure that all source files are loaded, set both load_files_scattered and load_files to 1. If the absence of some source files on some agent is not critical, set only load_files_scattered to 1.
CALL SNIPPETS(('data/doc1.txt','data/doc2.txt'), 'forum', 'is text', 1 AS load_files);
+----------------------------------------+
| snippet |
+----------------------------------------+
| this <b>is</b> my document <b>text</b> |
| this <b>is</b> my another <b>text</b> |
+----------------------------------------+
2 rows in set (0.02 sec)
Query results can be sorted by full-text ranking weight, one or more attributes or expressions.
Full-text queries return matches sorted by default. If nothing is specified, they are sorted by relevance, which is equivalent to ORDER BY weight() DESC in SQL format.
Non-full-text queries do not perform any sorting by default.
Extended mode is automatically enabled when you explicitly provide sorting rules by adding the ORDER BY clause in SQL format or using the sort option via HTTP JSON.
General syntax:
SELECT ... ORDER BY
{attribute_name | expr_alias | weight() | random() } [ASC | DESC],
...
{attribute_name | expr_alias | weight() | random() } [ASC | DESC]
In the sort clause, you can use any combination of up to 5 columns, each followed by asc or desc. Functions and expressions are not allowed as arguments for the sort clause, except for the weight() and random() functions (the latter can only be used via SQL in the form of ORDER BY random()). However, you can use any expression in the SELECT list and sort by its alias.
select *, a + b alias from test order by alias desc;
+------+------+------+----------+-------+
| id | a | b | f | alias |
+------+------+------+----------+-------+
| 1 | 2 | 3 | document | 5 |
+------+------+------+----------+-------+
"sort" specifies an array where each element can be an attribute name or _score if you want to sort by match weights. In that case, the sort order defaults to ascending for attributes and descending for _score.
{
"index":"test",
"query":
{
"match": { "title": "Test document" }
},
"sort": [ "_score", "id" ],
"_source": "title",
"limit": 3
}
{
"took": 0,
"timed_out": false,
"hits": {
"total": 5,
"total_relation": "eq",
"hits": [
{
"_id": "5406864699109146628",
"_score": 2319,
"_source": {
"title": "Test document 1"
}
},
{
"_id": "5406864699109146629",
"_score": 2319,
"_source": {
"title": "Test document 2"
}
},
{
"_id": "5406864699109146630",
"_score": 2319,
"_source": {
"title": "Test document 3"
}
}
]
}
}
$search->setIndex("test")->match('Test document')->sort('_score')->sort('id');
search_request.index = 'test'
search_request.fulltext_filter = manticoresearch.model.QueryFilter('Test document')
search_request.sort = ['_score', 'id']
searchRequest.index = "test";
searchRequest.fulltext_filter = new Manticoresearch.QueryFilter('Test document');
searchRequest.sort = ['_score', 'id'];
searchRequest.setIndex("test");
QueryFilter queryFilter = new QueryFilter();
queryFilter.setQueryString("Test document");
searchRequest.setFulltextFilter(queryFilter);
List<Object> sort = new ArrayList<Object>( Arrays.asList("_score", "id") );
searchRequest.setSort(sort);
var searchRequest = new SearchRequest("test");
searchRequest.FulltextFilter = new QueryFilter("Test document");
searchRequest.Sort = new List<Object> {"_score", "id"};
searchRequest = {
index: 'test',
query: {
query_string: {'Test document'},
},
sort: ['_score', 'id'],
}
searchRequest.SetIndex("test")
query := map[string]interface{} {"query_string": "Test document"}
searchRequest.SetQuery(query)
sort := map[string]interface{} {"_score": "asc", "id": "asc"}
searchRequest.SetSort(sort)
You can also specify the sort order explicitly:
asc: sort in ascending orderdesc: sort in descending order{
"index":"test",
"query":
{
"match": { "title": "Test document" }
},
"sort":
[
{ "id": "desc" },
"_score"
],
"_source": "title",
"limit": 3
}
{
"took": 0,
"timed_out": false,
"hits": {
"total": 5,
"total_relation": "eq",
"hits": [
{
"_id": "5406864699109146632",
"_score": 2319,
"_source": {
"title": "Test document 5"
}
},
{
"_id": "5406864699109146631",
"_score": 2319,
"_source": {
"title": "Test document 4"
}
},
{
"_id": "5406864699109146630",
"_score": 2319,
"_source": {
"title": "Test document 3"
}
}
]
}
}
$search->setIndex("test")->match('Test document')->sort('id', 'desc')->sort('_score');
search_request.index = 'test'
search_request.fulltext_filter = manticoresearch.model.QueryFilter('Test document')
sort_by_id = manticoresearch.model.SortOrder('id', 'desc')
search_request.sort = [sort_by_id, '_score']
searchRequest.index = "test";
searchRequest.fulltext_filter = new Manticoresearch.QueryFilter('Test document');
sortById = new Manticoresearch.SortOrder('id', 'desc');
searchRequest.sort = [sortById, 'id'];
searchRequest.setIndex("test");
QueryFilter queryFilter = new QueryFilter();
queryFilter.setQueryString("Test document");
searchRequest.setFulltextFilter(queryFilter);
List<Object> sort = new ArrayList<Object>();
SortOrder sortById = new SortOrder();
sortById.setAttr("id");
sortById.setOrder(SortOrder.OrderEnum.DESC);
sort.add(sortById);
sort.add("_score");
searchRequest.setSort(sort);
var searchRequest = new SearchRequest("test");
searchRequest.FulltextFilter = new QueryFilter("Test document");
searchRequest.Sort = new List<Object>();
var sortById = new SortOrder("id", SortOrder.OrderEnum.Desc);
searchRequest.Sort.Add(sortById);
searchRequest.Sort.Add("_score");
searchRequest = {
index: 'test',
query: {
query_string: {'Test document'},
},
sort: [{'id': 'desc'}, '_score'],
}
searchRequest.SetIndex("test")
query := map[string]interface{} {"query_string": "Test document"}
searchRequest.SetQuery(query)
sortById := map[string]interface{} {"id": "desc"}
sort := map[string]interface{} {"id": "desc", "_score": "asc"}
searchRequest.SetSort(sort)
You can also use another syntax and specify the sort order via the order property:
{
"index":"test",
"query":
{
"match": { "title": "Test document" }
},
"sort":
[
{ "id": { "order":"desc" } }
],
"_source": "title",
"limit": 3
}
{
"took": 0,
"timed_out": false,
"hits": {
"total": 5,
"total_relation": "eq",
"hits": [
{
"_id": "5406864699109146632",
"_score": 2319,
"_source": {
"title": "Test document 5"
}
},
{
"_id": "5406864699109146631",
"_score": 2319,
"_source": {
"title": "Test document 4"
}
},
{
"_id": "5406864699109146630",
"_score": 2319,
"_source": {
"title": "Test document 3"
}
}
]
}
}
$search->setIndex("test")->match('Test document')->sort('id', 'desc');
search_request.index = 'test'
search_request.fulltext_filter = manticoresearch.model.QueryFilter('Test document')
sort_by_id = manticoresearch.model.SortOrder('id', 'desc')
search_request.sort = [sort_by_id]
searchRequest.index = "test";
searchRequest.fulltext_filter = new Manticoresearch.QueryFilter('Test document');
sortById = new Manticoresearch.SortOrder('id', 'desc');
searchRequest.sort = [sortById];
searchRequest.setIndex("test");
QueryFilter queryFilter = new QueryFilter();
queryFilter.setQueryString("Test document");
searchRequest.setFulltextFilter(queryFilter);
List<Object> sort = new ArrayList<Object>();
SortOrder sortById = new SortOrder();
sortById.setAttr("id");
sortById.setOrder(SortOrder.OrderEnum.DESC);
sort.add(sortById);
searchRequest.setSort(sort);
var searchRequest = new SearchRequest("test");
searchRequest.FulltextFilter = new QueryFilter("Test document");
searchRequest.Sort = new List<Object>();
var sortById = new SortOrder("id", SortOrder.OrderEnum.Desc);
searchRequest.Sort.Add(sortById);
searchRequest = {
index: 'test',
query: {
query_string: {'Test document'},
},
sort: { {'id': {'order':'desc'} },
}
searchRequest.SetIndex("test")
query := map[string]interface{} {"query_string": "Test document"}
searchRequest.SetQuery(query)
sort := map[string]interface{} { "id": {"order":"desc"} }
searchRequest.SetSort(sort)
Sorting by MVA attributes is also supported in JSON queries. Sorting mode can be set via the mode property. The following modes are supported:
min: sort by minimum valuemax: sort by maximum value{
"index":"test",
"query":
{
"match": { "title": "Test document" }
},
"sort":
[
{ "attr_mva": { "order":"desc", "mode":"max" } }
],
"_source": "title",
"limit": 3
}
{
"took": 0,
"timed_out": false,
"hits": {
"total": 5,
"total_relation": "eq",
"hits": [
{
"_id": "5406864699109146631",
"_score": 2319,
"_source": {
"title": "Test document 4"
}
},
{
"_id": "5406864699109146629",
"_score": 2319,
"_source": {
"title": "Test document 2"
}
},
{
"_id": "5406864699109146628",
"_score": 2319,
"_source": {
"title": "Test document 1"
}
}
]
}
}
$search->setIndex("test")->match('Test document')->sort('id','desc','max');
search_request.index = 'test'
search_request.fulltext_filter = manticoresearch.model.QueryFilter('Test document')
sort = manticoresearch.model.SortMVA('attr_mva', 'desc', 'max')
search_request.sort = [sort]
searchRequest.index = "test";
searchRequest.fulltext_filter = new Manticoresearch.QueryFilter('Test document');
sort = new Manticoresearch.SortMVA('attr_mva', 'desc', 'max');
searchRequest.sort = [sort];
searchRequest.setIndex("test");
QueryFilter queryFilter = new QueryFilter();
queryFilter.setQueryString("Test document");
searchRequest.setFulltextFilter(queryFilter);
SortMVA sort = new SortMVA();
sort.setAttr("attr_mva");
sort.setOrder(SortMVA.OrderEnum.DESC);
sort.setMode(SortMVA.ModeEnum.MAX);
searchRequest.setSort(sort);
var searchRequest = new SearchRequest("test");
searchRequest.FulltextFilter = new QueryFilter("Test document");
var sort = new SortMVA("attr_mva", SortMVA.OrderEnum.Desc, SortMVA.ModeEnum.Max);
searchRequest.Sort.Add(sort);
searchRequest = {
index: 'test',
query: {
query_string: {'Test document'},
},
sort: { "attr_mva": { "order":"desc", "mode":"max" } },
}
searchRequest.SetIndex("test")
query := map[string]interface{} {"query_string": "Test document"}
searchRequest.SetQuery(query)
sort := map[string]interface{} { "attr_mva": { "order":"desc", "mode":"max" } }
searchRequest.SetSort(sort)
When sorting on an attribute, match weight (score) calculation is disabled by default (no ranker is used). You can enable weight calculation by setting the track_scores property to true:
{
"index":"test",
"track_scores": true,
"query":
{
"match": { "title": "Test document" }
},
"sort":
[
{ "attr_mva": { "order":"desc", "mode":"max" } }
],
"_source": "title",
"limit": 3
}
{
"took": 0,
"timed_out": false,
"hits": {
"total": 5,
"total_relation": "eq",
"hits": [
{
"_id": "5406864699109146631",
"_score": 2319,
"_source": {
"title": "Test document 4"
}
},
{
"_id": "5406864699109146629",
"_score": 2319,
"_source": {
"title": "Test document 2"
}
},
{
"_id": "5406864699109146628",
"_score": 2319,
"_source": {
"title": "Test document 1"
}
}
]
}
}
$search->setIndex("test")->match('Test document')->sort('id','desc','max')->trackScores(true);
search_request.index = 'test'
search_request.track_scores = true
search_request.fulltext_filter = manticoresearch.model.QueryFilter('Test document')
sort = manticoresearch.model.SortMVA('attr_mva', 'desc', 'max')
search_request.sort = [sort]
searchRequest.index = "test";
searchRequest.trackScores = true;
searchRequest.fulltext_filter = new Manticoresearch.QueryFilter('Test document');
sort = new Manticoresearch.SortMVA('attr_mva', 'desc', 'max');
searchRequest.sort = [sort];
searchRequest.setIndex("test");
searchRequest.setTrackScores(true);
QueryFilter queryFilter = new QueryFilter();
queryFilter.setQueryString("Test document");
searchRequest.setFulltextFilter(queryFilter);
SortMVA sort = new SortMVA();
sort.setAttr("attr_mva");
sort.setOrder(SortMVA.OrderEnum.DESC);
sort.setMode(SortMVA.ModeEnum.MAX);
searchRequest.setSort(sort);
var searchRequest = new SearchRequest("test");
searchRequest.SetTrackScores(true);
searchRequest.FulltextFilter = new QueryFilter("Test document");
var sort = new SortMVA("attr_mva", SortMVA.OrderEnum.Desc, SortMVA.ModeEnum.Max);
searchRequest.Sort.Add(sort);
searchRequest = {
index: 'test',
track_scores: true,
query: {
query_string: {'Test document'},
},
sort: { "attr_mva": { "order":"desc", "mode":"max" } },
}
searchRequest.SetIndex("test")
searchRequest.SetTrackScores(true)
query := map[string]interface{} {"query_string": "Test document"}
searchRequest.SetQuery(query)
sort := map[string]interface{} { "attr_mva": { "order":"desc", "mode":"max" } }
searchRequest.SetSort(sort)
Ranking (also known as weighting) of search results can be defined as a process of computing a so-called relevance (weight) for every given matched document regards to a given query that matched it. So relevance is, in the end, just a number attached to every document that estimates how relevant the document is to the query. Search results can then be sorted based on this number and/or some additional parameters, so that the most sought-after results would appear higher on the results page.
There is no single standard one-size-fits-all way to rank any document in any scenario. Moreover, there can never be such a way, because relevance is subjective. As in, what seems relevant to you might not seem relevant to me. Hence, in general cases, it's not just hard to compute; it's theoretically impossible.
So ranking in Manticore is configurable. It has a notion of a so-called ranker. A ranker can formally be defined as a function that takes a document and a query as its input and produces a relevance value as output. In layman's terms, a ranker controls exactly how (using which specific algorithm) Manticore will assign weights to the documents.
Manticore ships with several built-in rankers suited for different purposes. Many of them use two factors: phrase proximity (also known as LCS) and BM25. Phrase proximity works on keyword positions, while BM25 works on keyword frequencies. Essentially, the better the degree of phrase match between the document body and the query, the higher the phrase proximity (it maxes out when the document contains the entire query as a verbatim quote). And BM25 is higher when the document contains more rare words. We'll save the detailed discussion for later.
The currently implemented rankers are:
proximity_bm25, the default ranking mode that uses and combines both phrase proximity and BM25 ranking.bm25, a statistical ranking mode that uses BM25 ranking only (similar to most other full-text engines). This mode is faster but may result in worse quality for queries containing more than one keyword.none, a no-ranking mode. This mode is obviously the fastest. A weight of 1 is assigned to all matches. This is sometimes called boolean searching, which just matches the documents but does not rank them.wordcount, ranking by the keyword occurrences count. This ranker computes the per-field keyword occurrence counts, then multiplies them by field weights, and sums the resulting values.proximity returns the raw phrase proximity value as a result. This mode is internally used to emulate SPH_MATCH_ALL queries.matchany returns rank as it was computed in SPH_MATCH_ANY mode earlier and is internally used to emulate SPH_MATCH_ANY queries.fieldmask returns a 32-bit mask with the N-th bit corresponding to the N-th full-text field, numbering from 0. The bit will only be set when the respective field has any keyword occurrences satisfying the query.sph04 is generally based on the default 'proximity_bm25' ranker, but additionally boosts matches when they occur at the very beginning or the very end of a text field. Thus, if a field equals the exact query, sph04 should rank it higher than a field that contains the exact query but is not equal to it. (For instance, when the query is "Hyde Park", a document titled "Hyde Park" should be ranked higher than one titled "Hyde Park, London" or "The Hyde Park Cafe".)expr allows you to specify the ranking formula at runtime. It exposes several internal text factors and lets you define how the final weight should be computed from those factors. You can find more details about its syntax and a reference of available factors in a subsection below.The ranker name is case-insensitive. Example:
SELECT ... OPTION ranker=sph04;
| Name | Level | Type | Summary |
|---|---|---|---|
| max_lcs | query | int | maximum possible LCS value for the current query |
| bm25 | document | int | quick estimate of BM25(1.2, 0) |
| bm25a(k1, b) | document | int | precise BM25() value with configurable K1, B constants and syntax support |
| bm25f(k1, b, {field=weight, ...}) | document | int | precise BM25F() value with extra configurable field weights |
| field_mask | document | int | bit mask of matched fields |
| query_word_count | document | int | number of unique inclusive keywords in a query |
| doc_word_count | document | int | number of unique keywords matched in the document |
| lcs | field | int | Longest Common Subsequence between query and document, in words |
| user_weight | field | int | user field weight |
| hit_count | field | int | total number of keyword occurrences |
| word_count | field | int | number of unique matched keywords |
| tf_idf | field | float | sum(tf*idf) over matched keywords == sum(idf) over occurrences |
| min_hit_pos | field | int | first matched occurrence position, in words, 1-based |
| min_best_span_pos | field | int | first maximum LCS span position, in words, 1-based |
| exact_hit | field | bool | whether query == field |
| min_idf | field | float | min(idf) over matched keywords |
| max_idf | field | float | max(idf) over matched keywords |
| sum_idf | field | float | sum(idf) over matched keywords |
| exact_order | field | bool | whether all query keywords were a) matched and b) in query order |
| min_gaps | field | int | minimum number of gaps between the matched keywords over the matching spans |
| lccs | field | int | Longest Common Contiguous Subsequence between query and document, in words |
| wlccs | field | float | Weighted Longest Common Contiguous Subsequence, sum(idf) over contiguous keyword spans |
| atc | field | float | Aggregate Term Closeness, log(1+sum(idf1idf2pow(distance, -1.75)) over the best pairs of keywords |
A document-level factor is a numeric value computed by the ranking engine for every matched document with regards to the current query. So it differs from a plain document attribute in that the attribute does not depend on the full text query, while factors might. These factors can be used anywhere in the ranking expression. Currently implemented document-level factors are:
bm25 (integer), a document-level BM25 estimate (computed without keyword occurrence filtering).max_lcs (integer), a query-level maximum possible value that the sum(lcs*user_weight) expression can ever take. This can be useful for weight boost scaling. For instance, MATCHANY ranker formula uses this to guarantee that a full phrase match in any field ranks higher than any combination of partial matches in all fields.field_mask (integer), a document-level 32-bit mask of matched fields.query_word_count (integer), the number of unique keywords in a query, adjusted for the number of excluded keywords. For instance, both (one one one one) and (one !two) queries should assign a value of 1 to this factor, because there is just one unique non-excluded keyword.doc_word_count (integer), the number of unique keywords matched in the entire document.A field-level factor is a numeric value computed by the ranking engine for every matched in-document text field regards to the current query. As more than one field can be matched by a query, but the final weight needs to be a single integer value, these values need to be folded into a single one. To achieve that, field-level factors can only be used within a field aggregation function, they can not be used anywhere in the expression. For example, you cannot use (lcs+bm25) as your ranking expression, as lcs takes multiple values (one in every matched field). You should use (sum(lcs)+bm25) instead, that expression sums lcs over all matching fields, and then adds bm25 to that per-field sum. Currently implemented field-level factors are:
lcs (integer), the length of a maximum verbatim match between the document and the query, counted in words. LCS stands for Longest Common Subsequence (or Subset). Takes a minimum value of 1 when only stray keywords were matched in a field, and a maximum value of query keywords count when the entire query was matched in a field verbatim (in the exact query keywords order). For example, if the query is 'hello world' and the field contains these two words quoted from the query (that is, adjacent to each other, and exactly in the query order), lcs will be 2. For example, if the query is 'hello world program' and the field contains 'hello world', lcs will be 2. Note that any subset of the query keyword works, not just a subset of adjacent keywords. For example, if the query is 'hello world program' and the field contains 'hello (test program)', lcs will be 2 just as well, because both 'hello' and 'program' matched in the same respective positions as they were in the query. Finally, if the query is 'hello world program' and the field contains 'hello world program', lcs will be 3. (Hopefully that is unsurprising at this point.)user_weight (integer), the user specified per-field weight (refer to OPTION field_weights in SQL). The weights default to 1 if not specified explicitly.hit_count (integer), the number of keyword occurrences that matched in the field. Note that a single keyword may occur multiple times. For example, if 'hello' occurs 3 times in a field and 'world' occurs 5 times, hit_count will be 8.word_count (integer), the number of unique keywords matched in the field. For example, if 'hello' and 'world' occur anywhere in a field, word_count will be 2, regardless of how many times both keywords occur.tf_idf (float), the sum of TF/IDF over all the keywords matched in the field. IDF is the Inverse Document Frequency, a floating point value between 0 and 1 that describes how frequent the keyword is (basically, 0 for a keyword that occurs in every document indexed, and 1 for a unique keyword that occurs in just a single document). TF is the Term Frequency, the number of matched keyword occurrences in the field. As a side note, tf_idf is actually computed by summing IDF over all matched occurrences. That's by construction equivalent to summing TF*IDF over all matched keywords.min_hit_pos (integer), the position of the first matched keyword occurrence, counted in wordsTherefore, this is a relatively low-level, "raw" factor that you'll likely want to adjust before using it for ranking. The specific adjustments depend heavily on your data and the resulting formula, but here are a few ideas to start with: (a) any min_gaps-based boosts could be simply ignored when word_count<2;
(b) non-trivial min_gaps values (i.e., when word_count>=2) could be clamped with a certain "worst-case" constant, while trivial values (i.e., when min_gaps=0 and word_count<2) could be replaced by that constant;
(c) a transfer function like 1/(1+min_gaps) could be applied (so that better, smaller min_gaps values would maximize it, and worse, larger min_gaps values would fall off slowly); and so on.
lccs (integer). Longest Common Contiguous Subsequence. The length of the longest subphrase common between the query and the document, computed in keywords.
The LCCS factor is somewhat similar to LCS but more restrictive. While LCS can be greater than 1 even if no two query words are matched next to each other, LCCS will only be greater than 1 if there are exact, contiguous query subphrases in the document. For example, (one two three four five) query vs (one hundred three hundred five hundred) document would yield lcs=3, but lccs=1, because although the mutual dispositions of 3 keywords (one, three, five) match between the query and the document, no 2 matching positions are actually adjacent.
Note that LCCS still doesn't differentiate between frequent and rare keywords; for that, see WLCCS.
wlccs (float). Weighted Longest Common Contiguous Subsequence. The sum of IDFs of the keywords of the longest subphrase common between the query and the document.
WLCCS is calculated similarly to LCCS, but every "suitable" keyword occurrence increases it by the keyword IDF instead of just by 1 (as with LCS and LCCS). This allows ranking sequences of rarer and more important keywords higher than sequences of frequent keywords, even if the latter are longer. For example, a query (Zanzibar bed and breakfast) would yield lccs=1 for a (hotels of Zanzibar) document, but lccs=3 against (London bed and breakfast), even though "Zanzibar" is actually somewhat rarer than the entire "bed and breakfast" phrase. The WLCCS factor addresses this issue by using keyword frequencies.
atc (float). Aggregate Term Closeness. A proximity-based measure that increases when the document contains more groups of more closely located and more important (rare) query keywords.
WARNING: you should use ATC with OPTION idf='plain,tfidf_unnormalized' (see below); otherwise, you may get unexpected results.
ATC essentially operates as follows. For each keyword occurrence in the document, we compute the so-called term closeness. To do this, we examine all the other closest occurrences of all the query keywords (including the keyword itself) to the left and right of the subject occurrence, calculate a distance dampening coefficient as k = pow(distance, -1.75) for these occurrences, and sum the dampened IDFs. As a result, for every occurrence of each keyword, we obtain a "closeness" value that describes the "neighbors" of that occurrence. We then multiply these per-occurrence closenesses by their respective subject keyword IDF, sum them all, and finally compute a logarithm of that sum.
In other words, we process the best (closest) matched keyword pairs in the document, and compute pairwise "closenesses" as the product of their IDFs scaled by the distance coefficient:
pair_tc = idf(pair_word1) * idf(pair_word2) * pow(pair_distance, -1.75)
We then sum such closenesses, and compute the final, log-dampened ATC value:
atc = log(1+sum(pair_tc))
Note that this final dampening logarithm is precisely the reason you should use OPTION idf=plain because, without it, the expression inside the log() could be negative.
Having closer keyword occurrences contributes much more to ATC than having more frequent keywords. Indeed, when the keywords are right next to each other, distance=1 and k=1; when there's just one word in between them, distance=2 and k=0.297, with two words between, distance=3 and k=0.146, and so on. At the same time, IDF attenuates somewhat slower. For example, in a 1 million document collection, the IDF values for keywords that match in 10, 100, and 1000 documents would be respectively 0.833, 0.667, and 0.500. So a keyword pair with two rather rare keywords that occur in just 10 documents each but with 2 other words in between would yield pair_tc = 0.101 and thus barely outweigh a pair with a 100-doc and a 1000-doc keyword with 1 other word between them and pair_tc = 0.099. Moreover, a pair of two unique, 1-doc keywords with 3 words between them would get a pair_tc = 0.088 and lose to a pair of two 1000-doc keywords located right next to each other and yielding a pair_tc = 0.25. So, basically, while ATC does combine both keyword frequency and proximity, it still somewhat favors proximity.
A field aggregation function is a single-argument function that accepts an expression with field-level factors, iterates over all matched fields, and computes the final results. The currently implemented field aggregation functions include:
sum, which adds the argument expression over all matched fields. For example sum(1) should return the number of matched fields.top, which returns the highest value of the argument across all matched fields.Most other rankers can actually be emulated using the expression-based ranker. You just need to provide an appropriate expression. While this emulation will likely be slower than using the built-in, compiled ranker, it may still be interesting if you want to fine-tune your ranking formula starting with one of the existing ones. Additionally, the formulas describe the ranker details in a clear, readable manner.
sum(lcs*user_weight)*1000+bm25sum(user_weight)*1000+bm251sum(hit_count*user_weight)sum(lcs*user_weight)sum((word_count+(lcs-1)*max_lcs)*user_weight)field_masksum((4*lcs+2*(min_hit_pos==1)+exact_hit)*user_weight)*1000+bm25The historically default IDF (Inverse Document Frequency) in Manticore is equivalent to OPTION idf='normalized,tfidf_normalized', and those normalizations may cause several undesired effects.
First, idf=normalized causes keyword penalization. For instance, if you search for the | something and the occurs in more than 50% of the documents, then documents with both keywords the and[something will get less weight than documents with just one keyword something. Using OPTION idf=plain avoids this.
Plain IDF varies in [0, log(N)] range, and keywords are never penalized; while the normalized IDF varies in [-log(N), log(N)] range, and too frequent keywords are penalized.
Second, idf=tfidf_normalized causes IDF drift over queries. Historically, we additionally divided IDF by query keyword count, so that the entire sum(tf*idf) over all keywords would still fit into [0,1] range. However, that means that queries word1 and word1 | nonmatchingword2 would assign different weights to the exactly same result set, because the IDFs for both word1 and nonmatchingword2 would be divided by 2. OPTION idf='tfidf_unnormalized' fixes that. Note that BM25, BM25A, BM25F() ranking factors will be scale accordingly once you disable this normalization.
IDF flags can be mixed; plain and normalized are mutually exclusive;tfidf_unnormalized and tfidf_normalized are mutually exclusive; and unspecified flags in such a mutually exclusive group take their defaults. That means that OPTION idf=plain is equivalent to a complete OPTION idf='plain,tfidf_normalized' specification.
Manticore Search returns the top 20 matched documents in the result set by default.
In SQL, you can navigate through the result set using the LIMIT clause.
LIMIT can accept either one number as the size of the returned set with a zero offset, or a pair of offset and size values.
When using HTTP JSON, the nodes offset and limit control the offset of the result set and the size of the returned set. Alternatively, you can use the pair size and from instead.
SELECT ... FROM ... [LIMIT [offset,] row_count]
SELECT ... FROM ... [LIMIT row_count][ OFFSET offset]
{
"index": "<index_name>",
"query": ...
...
"limit": 20,
"offset": 0
}
{
"index": "<index_name>",
"query": ...
...
"size": 20,
"from": 0
}
By default, Manticore Search uses a result set window of 1000 best-ranked documents that can be returned in the result set. If the result set is paginated beyond this value, the query will end in error.
This limitation can be adjusted with the query option max_matches.
Increasing the max_matches to very high values should only be done if it's necessary for the navigation to reach such points. A high max_matches value requires more memory and can increase the query response time. One way to work with deep result sets is to set max_matches as the sum of the offset and limit.
Lowering max_matches below 1000 has the benefit of reducing the memory used by the query. It can also reduce the query time, but in most cases, it might not be a noticeable gain.
SELECT ... FROM ... OPTION max_matches=<value>
{
"index": "<index_name>",
"query": ...
...
"max_matches":<value>
}
}
Manticore is designed to scale effectively through its distributed searching capabilities. Distributed searching is beneficial for improving query latency (i.e., search time) and throughput (i.e., max queries/sec) in multi-server, multi-CPU, or multi-core environments. This is crucial for applications that need to search through vast amounts of data (i.e., billions of records and terabytes of text).
The primary concept is to horizontally partition the searched data across search nodes and process it in parallel.
Partitioning is done manually. To set it up, you should:
searchd instancesThis type of table only contains references to other local and remote tables - so it cannot be directly reindexed. Instead, you should reindex the tables that it references.
When Manticore receives a query against a distributed table, it performs the following steps:
From the application's perspective, there are no differences between searching through a regular table or a distributed table. In other words, distributed tables are fully transparent to the application, and there's no way to tell whether the table you queried was distributed or local.
Learn more about remote nodes.
Multi-queries, or query batches, allow you to send multiple search queries to Manticore in a single network request.
👍 Why use multi-queries?
The primary reason is performance. By sending requests to Manticore in a batch instead of one by one, you save time by reducing network round-trips. Additionally, sending queries in a batch allows Manticore to perform certain internal optimizations. If no batch optimizations can be applied, queries will be processed individually.
⛔ When not to use multi-queries?
Multi-queries require all search queries in a batch to be independent, which isn't always the case. Sometimes query B depends on query A's results, meaning query B can only be set up after executing query A. For example, you might want to display results from a secondary index only if no results were found in the primary table, or you may want to specify an offset into the 2nd result set based on the number of matches in the 1st result set. In these cases, you'll need to use separate queries (or separate batches).
You can run multiple search queries with SQL by separating them with a semicolon. When Manticore receives a query formatted like this from a client, all inter-statement optimizations will be applied.
Multi-queries don't support queries with FACET. The number of multi-queries in one batch shouldn't exceed max_batch_queries.
SELECT id, price FROM products WHERE MATCH('remove hair') ORDER BY price DESC; SELECT id, price FROM products WHERE MATCH('remove hair') ORDER BY price ASC
There are two major optimizations to be aware of: common query optimization and common subtree optimization.
Common query optimization means that searchd will identify all those queries in a batch where only the sorting and group-by settings differ, and only perform searching once. For example, if a batch consists of 3 queries, all of them are for "ipod nano", but the 1st query requests the top-10 results sorted by price, the 2nd query groups by vendor ID and requests the top-5 vendors sorted by rating, and the 3rd query requests the max price, full-text search for "ipod nano" will only be performed once, and its results will be reused to build 3 different result sets.
Faceted search is a particularly important case that benefits from this optimization. Indeed, faceted searching can be implemented by running several queries, one to retrieve search results themselves, and a few others with the same full-text query but different group-by settings to retrieve all the required groups of results (top-3 authors, top-5 vendors, etc). As long as the full-text query and filtering settings stay the same, common query optimization will trigger, and greatly improve performance.
Common subtree optimization is even more interesting. It allows searchd to exploit similarities between batched full-text queries. It identifies common full-text query parts (subtrees) in all queries and caches them between queries. For example, consider the following query batch:
donald trump president
donald trump barack obama john mccain
donald trump speech
There's a common two-word part donald trump that can be computed only once, then cached and shared across the queries. And common subtree optimization does just that. Per-query cache size is strictly controlled by subtree_docs_cache and subtree_hits_cache directives (so that caching all sixteen gazillions of documents that match "i am" does not exhaust the RAM and instantly kill your server).
How can you tell if the queries in the batch were actually optimized? If they were, the respective query log will have a "multiplier" field that specifies how many queries were processed together:
Note the "x3" field. It means that this query was optimized and processed in a sub-batch of 3 queries.
[Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext/0/rel 747541 (0,20)] [lj] the
[Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext/0/ext 747541 (0,20)] [lj] the
[Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext/0/ext 747541 (0,20)] [lj] the
For reference, this is how the regular log would look like if the queries were not batched:
[Sun Jul 12 15:18:17.062 2009] 0.059 sec [ext/0/rel 747541 (0,20)] [lj] the
[Sun Jul 12 15:18:17.156 2009] 0.091 sec [ext/0/ext 747541 (0,20)] [lj] the
[Sun Jul 12 15:18:17.250 2009] 0.092 sec [ext/0/ext 747541 (0,20)] [lj] the
Notice how the per-query time in the multi-query case improved by a factor of 1.5x to 2.3x, depending on the specific sorting mode.
Manticore supports SELECT subqueries via SQL in the following format:
SELECT * FROM (SELECT ... ORDER BY cond1 LIMIT X) ORDER BY cond2 LIMIT Y
The outer select allows only ORDER BY and LIMIT clauses. Sub-select queries currently have two use cases:
SELECT id,slow_rank() as slow,fast_rank() as fast FROM index
WHERE MATCH(‘some common query terms’) ORDER BY fast DESC, slow DESC LIMIT 20
OPTION max_matches=1000;
With sub-selects, the query can be rewritten as:
SELECT * FROM
(SELECT id,slow_rank() as slow,fast_rank() as fast FROM index WHERE
MATCH(‘some common query terms’)
ORDER BY fast DESC LIMIT 100 OPTION max_matches=1000)
ORDER BY slow DESC LIMIT 20;
In the initial query, the slow_rank() UDF is computed for the entire match result set. With SELECT sub-queries, only fast_rank() is computed for the entire match result set, while slow_rank() is computed for a limited set.
For this query:
SELECT * FROM my_dist_index WHERE some_conditions LIMIT 50000;
If you have 20 nodes, each node can send back to the master a maximum of 50K records, resulting in 20 x 50K = 1M records. However, since the master sends back only 50K (out of 1M), it might be good enough for the nodes to send only the top 10K records. With sub-select, you can rewrite the query as:
SELECT * FROM
(SELECT * FROM my_dist_index WHERE some_conditions LIMIT 10000)
ORDER by some_attr LIMIT 50000;
In this case, the nodes receive only the inner query and execute it. This means the master will receive only 20x10K=200K records. The master will take all the records received, reorder them by the OUTER clause, and return the best 50K records. The sub-select helps reduce the traffic between the master and the nodes, as well as reduce the master's computation time (since it processes only 200K instead of 1M records).
Grouping search results is often helpful for obtaining per-group match counts or other aggregations. For example, it's useful for creating a graph illustrating the number of matching blog posts per month or grouping web search results by site or forum posts by author, etc.
Manticore supports the grouping of search results by single or multiple columns and computed expressions. The results can:
The general syntax is:
SELECT {* | SELECT_expr [, SELECT_expr ...]}
...
GROUP BY {field_name | alias } [, ...]
[HAVING where_condition]
[WITHIN GROUP ORDER BY field_name {ASC | DESC} [, ...]]
...
SELECT_expr: { field_name | function_name(...) }
where_condition: {aggregation expression alias | COUNT(*)}
{
"index": "<index_name>",
"limit": 0,
"aggs": {
"<aggr_name>": {
"terms": {
"field": "<attribute>",
"size": <int value>
}
}
}
}
Grouping is quite simple - just add "GROUP BY smth" to the end of your SELECT query. The something can be:
SELECT list, you can GROUP BY it tooYou can omit any aggregation functions in the SELECT list and it will still work:
SELECT release_year FROM films GROUP BY release_year LIMIT 5;
+--------------+
| release_year |
+--------------+
| 2004 |
| 2002 |
| 2001 |
| 2005 |
| 2000 |
+--------------+
In most cases, however, you'll want to obtain some aggregated data for each group, such as:
COUNT(*) to simply get the number of elements in each groupAVG(field) to calculate the average value of the field within the groupSELECT release_year, count(*) FROM films GROUP BY release_year LIMIT 5;
+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
| 2005 | 93 |
| 2000 | 97 |
+--------------+----------+
SELECT release_year, AVG(rental_rate) FROM films GROUP BY release_year LIMIT 5;
+--------------+------------------+
| release_year | avg(rental_rate) |
+--------------+------------------+
| 2004 | 2.78629661 |
| 2002 | 3.08259249 |
| 2001 | 3.09989142 |
| 2005 | 2.90397978 |
| 2000 | 3.17556739 |
+--------------+------------------+
POST /search -d '
{
"index" : "films",
"limit": 0,
"aggs" :
{
"release_year" :
{
"terms" :
{
"field":"release_year",
"size":100
}
}
}
}
'
{
"took": 2,
"timed_out": false,
"hits": {
"total": 10000,
"hits": [
]
},
"release_year": {
"group_brand_id": {
"buckets": [
{
"key": 2004,
"doc_count": 108
},
{
"key": 2002,
"doc_count": 108
},
{
"key": 2000,
"doc_count": 97
},
{
"key": 2005,
"doc_count": 93
},
{
"key": 2001,
"doc_count": 91
}
]
}
}
}
$index->setName('films');
$search = $index->search('');
$search->limit(0);
$search->facet('release_year','release_year',100);
$results = $search->get();
print_r($results->getFacets());
Array
(
[release_year] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => 2009
[doc_count] => 99
)
[1] => Array
(
[key] => 2008
[doc_count] => 102
)
[2] => Array
(
[key] => 2007
[doc_count] => 93
)
[3] => Array
(
[key] => 2006
[doc_count] => 103
)
[4] => Array
(
[key] => 2005
[doc_count] => 93
)
[5] => Array
(
[key] => 2004
[doc_count] => 108
)
[6] => Array
(
[key] => 2003
[doc_count] => 106
)
[7] => Array
(
[key] => 2002
[doc_count] => 108
)
[8] => Array
(
[key] => 2001
[doc_count] => 91
)
[9] => Array
(
[key] => 2000
[doc_count] => 97
)
)
)
)
res =searchApi.search({"index":"films","limit":0,"aggs":{"release_year":{"terms":{"field":"release_year","size":100}}}})
{'aggregations': {u'release_year': {u'buckets': [{u'doc_count': 99,
u'key': 2009},
{u'doc_count': 102,
u'key': 2008},
{u'doc_count': 93,
u'key': 2007},
{u'doc_count': 103,
u'key': 2006},
{u'doc_count': 93,
u'key': 2005},
{u'doc_count': 108,
u'key': 2004},
{u'doc_count': 106,
u'key': 2003},
{u'doc_count': 108,
u'key': 2002},
{u'doc_count': 91,
u'key': 2001},
{u'doc_count': 97,
u'key': 2000}]}},
'hits': {'hits': [], 'max_score': None, 'total': 1000},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"films","limit":0,"aggs":{"release_year":{"terms":{"field":"release_year","size":100}}}});
{"took":0,"timed_out":false,"aggregations":{"release_year":{"buckets":[{"key":2009,"doc_count":99},{"key":2008,"doc_count":102},{"key":2007,"doc_count":93},{"key":2006,"doc_count":103},{"key":2005,"doc_count":93},{"key":2004,"doc_count":108},{"key":2003,"doc_count":106},{"key":2002,"doc_count":108},{"key":2001,"doc_count":91},{"key":2000,"doc_count":97}]}},"hits":{"total":1000,"hits":[]}}
HashMap<String,Object> aggs = new HashMap<String,Object>(){{
put("release_year", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","release_year");
put("size",100);
}});
}});
}};
searchRequest = new SearchRequest();
searchRequest.setIndex("films");
searchRequest.setLimit(0);
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setAggs(aggs);
searchResponse = searchApi.search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
aggregations: {release_year={buckets=[{key=2009, doc_count=99}, {key=2008, doc_count=102}, {key=2007, doc_count=93}, {key=2006, doc_count=103}, {key=2005, doc_count=93}, {key=2004, doc_count=108}, {key=2003, doc_count=106}, {key=2002, doc_count=108}, {key=2001, doc_count=91}, {key=2000, doc_count=97}]}}
hits: class SearchResponseHits {
maxScore: null
total: 1000
hits: []
}
profile: null
}
var agg = new Aggregation("release_year", "release_year");
agg.Size = 100;
object query = new { match_all=null };
var searchRequest = new SearchRequest("films", query);
searchRequest.Aggs = new List<Aggregation> {agg};
var searchResponse = searchApi.Search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
aggregations: {release_year={buckets=[{key=2009, doc_count=99}, {key=2008, doc_count=102}, {key=2007, doc_count=93}, {key=2006, doc_count=103}, {key=2005, doc_count=93}, {key=2004, doc_count=108}, {key=2003, doc_count=106}, {key=2002, doc_count=108}, {key=2001, doc_count=91}, {key=2000, doc_count=97}]}}
hits: class SearchResponseHits {
maxScore: null
total: 1000
hits: []
}
profile: null
}
res = await searchApi.search({
index: 'test',
limit: 0,
aggs: {
cat_id: {
terms: { field: "cat", size: 1 }
}
}
});
{
"took":0,
"timed_out":false,
"aggregations":
{
"cat_id":
{
"buckets":
[{
"key":1,
"doc_count":1
}]
}
},
"hits":
{
"total":5,
"hits":[]
}
}
query := map[string]interface{} {};
searchRequest.SetQuery(query);
aggTerms := manticoreclient.NewAggregationTerms()
aggTerms.SetField("cat")
aggTerms.SetSize(1)
aggregation := manticoreclient.NewAggregation()
aggregation.setTerms(aggTerms)
searchRequest.SetAggregation(aggregation)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took":0,
"timed_out":false,
"aggregations":
{
"cat_id":
{
"buckets":
[{
"key":1,
"doc_count":1
}]
}
},
"hits":
{
"total":5,
"hits":[]
}
}
By default, groups are not sorted, and the next thing you typically want to do is order them by something, like the field you're grouping by:
SELECT release_year, count(*) from films GROUP BY release_year ORDER BY release_year asc limit 5;
+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2000 | 97 |
| 2001 | 91 |
| 2002 | 108 |
| 2003 | 106 |
| 2004 | 108 |
+--------------+----------+
Alternatively, you can sort by the aggregation:
count(*) to display groups with the most elements firstavg(rental_rate) to show the highest-rated movies first. Note that in the example, it's done via an alias: avg(rental_rate) is first mapped to avg in the SELECT list, and then we simply do ORDER BY avgSELECT release_year, count(*) FROM films GROUP BY release_year ORDER BY count(*) desc LIMIT 5;
+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2003 | 106 |
| 2006 | 103 |
| 2008 | 102 |
+--------------+----------+
SELECT release_year, AVG(rental_rate) avg FROM films GROUP BY release_year ORDER BY avg desc LIMIT 5;
+--------------+------------+
| release_year | avg |
+--------------+------------+
| 2006 | 3.26184368 |
| 2000 | 3.17556739 |
| 2001 | 3.09989142 |
| 2002 | 3.08259249 |
| 2008 | 2.99000049 |
+--------------+------------+
In some cases, you might want to group not just by a single field, but by multiple fields at once, such as a movie's category and year:
SELECT category_id, release_year, count(*) FROM films GROUP BY category_id, release_year ORDER BY category_id ASC, release_year ASC;
+-------------+--------------+----------+
| category_id | release_year | count(*) |
+-------------+--------------+----------+
| 1 | 2000 | 5 |
| 1 | 2001 | 2 |
| 1 | 2002 | 6 |
| 1 | 2003 | 6 |
| 1 | 2004 | 5 |
| 1 | 2005 | 10 |
| 1 | 2006 | 4 |
| 1 | 2007 | 5 |
| 1 | 2008 | 7 |
| 1 | 2009 | 14 |
| 2 | 2000 | 10 |
| 2 | 2001 | 5 |
| 2 | 2002 | 6 |
| 2 | 2003 | 6 |
| 2 | 2004 | 10 |
| 2 | 2005 | 4 |
| 2 | 2006 | 5 |
| 2 | 2007 | 8 |
| 2 | 2008 | 8 |
| 2 | 2009 | 4 |
+-------------+--------------+----------+
POST /search -d '
{
"size": 0,
"index": "films",
"aggs": {
"cat_release": {
"composite": {
"size":5,
"sources": [
{ "category": { "terms": { "field": "category_id" } } },
{ "release year": { "terms": { "field": "release_year" } } }
]
}
}
}
}
'
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1000,
"total_relation": "eq",
"hits": []
},
"aggregations": {
"cat_release": {
"after_key": {
"category": 1,
"release year": 2007
},
"buckets": [
{
"key": {
"category": 1,
"release year": 2008
},
"doc_count": 7
},
{
"key": {
"category": 1,
"release year": 2009
},
"doc_count": 14
},
{
"key": {
"category": 1,
"release year": 2005
},
"doc_count": 10
},
{
"key": {
"category": 1,
"release year": 2004
},
"doc_count": 5
},
{
"key": {
"category": 1,
"release year": 2007
},
"doc_count": 5
}
]
}
}
}
Sometimes it's useful to see not just a single element per group, but multiple. This can be easily achieved with the help of GROUP N BY. For example, in the following case, we get two movies for each year rather than just one, which a simple GROUP BY release_year would have returned.
SELECT release_year, title FROM films GROUP 2 BY release_year ORDER BY release_year DESC LIMIT 6;
+--------------+-----------------------------+
| release_year | title |
+--------------+-----------------------------+
| 2009 | ALICE FANTASIA |
| 2009 | ALIEN CENTER |
| 2008 | AMADEUS HOLY |
| 2008 | ANACONDA CONFESSIONS |
| 2007 | ANGELS LIFE |
| 2007 | ARACHNOPHOBIA ROLLERCOASTER |
+--------------+-----------------------------+
Another crucial analytics requirement is to sort elements within a group. To achieve this, use the WITHIN GROUP ORDER BY ... {ASC|DESC} clause. For example, let's get the highest-rated film for each year. Note that it works in parallel with just ORDER BY:
WITHIN GROUP ORDER BY sorts results inside a groupGROUP BY sorts the groups themselvesThese two work entirely independently.
SELECT release_year, title, rental_rate FROM films GROUP BY release_year WITHIN GROUP ORDER BY rental_rate DESC ORDER BY release_year DESC LIMIT 5;
+--------------+------------------+-------------+
| release_year | title | rental_rate |
+--------------+------------------+-------------+
| 2009 | AMERICAN CIRCUS | 4.990000 |
| 2008 | ANTHEM LUKE | 4.990000 |
| 2007 | ATTACKS HATE | 4.990000 |
| 2006 | ALADDIN CALENDAR | 4.990000 |
| 2005 | AIRPLANE SIERRA | 4.990000 |
+--------------+------------------+-------------+
HAVING expression is a helpful clause for filtering groups. While WHERE is applied before grouping, HAVING works with the groups. For example, let's keep only those years when the average rental rate of the films for that year was higher than 3. We get only four years:
SELECT release_year, avg(rental_rate) avg FROM films GROUP BY release_year HAVING avg > 3;
+--------------+------------+
| release_year | avg |
+--------------+------------+
| 2002 | 3.08259249 |
| 2001 | 3.09989142 |
| 2000 | 3.17556739 |
| 2006 | 3.26184368 |
+--------------+------------+
There is a function GROUPBY() which returns the key of the current group. It's useful in many cases, especially when you GROUP BY an MVA or a JSON value.
It can also be used in HAVING, for example, to keep only years 2000 and 2002.
Note that GROUPBY()is not recommended for use when you GROUP BY multiple fields at once. It will still work, but since the group key in this case is a compound of field values, it may not appear the way you expect.
SELECT release_year, count(*) FROM films GROUP BY release_year HAVING GROUPBY() IN (2000, 2002);
+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2002 | 108 |
| 2000 | 97 |
+--------------+----------+
Manticore supports grouping by MVA. To demonstrate how it works, let's create a table "shoes" with MVA "sizes" and insert a few documents into it:
create table shoes(title text, sizes multi);
insert into shoes values(0,'nike',(40,41,42)),(0,'adidas',(41,43)),(0,'reebook',(42,43));
so we have:
SELECT * FROM shoes;
+---------------------+----------+---------+
| id | sizes | title |
+---------------------+----------+---------+
| 1657851069130080265 | 40,41,42 | nike |
| 1657851069130080266 | 41,43 | adidas |
| 1657851069130080267 | 42,43 | reebook |
+---------------------+----------+---------+
If we now GROUP BY "sizes", it will process all our multi-value attributes and return an aggregation for each, in this case just the count:
SELECT groupby() gb, count(*) FROM shoes GROUP BY sizes ORDER BY gb asc;
+------+----------+
| gb | count(*) |
+------+----------+
| 40 | 1 |
| 41 | 2 |
| 42 | 2 |
| 43 | 2 |
+------+----------+
POST /search -d '
{
"index" : "shoes",
"limit": 0,
"aggs" :
{
"sizes" :
{
"terms" :
{
"field":"sizes",
"size":100
}
}
}
}
'
{
"took": 0,
"timed_out": false,
"hits": {
"total": 3,
"hits": [
]
},
"aggregations": {
"sizes": {
"buckets": [
{
"key": 43,
"doc_count": 2
},
{
"key": 42,
"doc_count": 2
},
{
"key": 41,
"doc_count": 2
},
{
"key": 40,
"doc_count": 1
}
]
}
}
}
$index->setName('shoes');
$search = $index->search('');
$search->limit(0);
$search->facet('sizes','sizes',100);
$results = $search->get();
print_r($results->getFacets());
Array
(
[sizes] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => 43
[doc_count] => 2
)
[1] => Array
(
[key] => 42
[doc_count] => 2
)
[2] => Array
(
[key] => 41
[doc_count] => 2
)
[3] => Array
(
[key] => 40
[doc_count] => 1
)
)
)
)
res =searchApi.search({"index":"shoes","limit":0,"aggs":{"sizes":{"terms":{"field":"sizes","size":100}}}})
{'aggregations': {u'sizes': {u'buckets': [{u'doc_count': 2, u'key': 43},
{u'doc_count': 2, u'key': 42},
{u'doc_count': 2, u'key': 41},
{u'doc_count': 1, u'key': 40}]}},
'hits': {'hits': [], 'max_score': None, 'total': 3},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"shoes","limit":0,"aggs":{"sizes":{"terms":{"field":"sizes","size":100}}}});
{"took":0,"timed_out":false,"aggregations":{"sizes":{"buckets":[{"key":43,"doc_count":2},{"key":42,"doc_count":2},{"key":41,"doc_count":2},{"key":40,"doc_count":1}]}},"hits":{"total":3,"hits":[]}}
HashMap<String,Object> aggs = new HashMap<String,Object>(){{
put("release_year", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","release_year");
put("size",100);
}});
}});
}};
searchRequest = new SearchRequest();
searchRequest.setIndex("films");
searchRequest.setLimit(0);
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setAggs(aggs);
searchResponse = searchApi.search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
aggregations: {release_year={buckets=[{key=43, doc_count=2}, {key=42, doc_count=2}, {key=41, doc_count=2}, {key=40, doc_count=1}]}}
hits: class SearchResponseHits {
maxScore: null
total: 3
hits: []
}
profile: null
}
var agg = new Aggregation("release_year", "release_year");
agg.Size = 100;
object query = new { match_all=null };
var searchRequest = new SearchRequest("films", query);
searchRequest.Limit = 0;
searchRequest.Aggs = new List<Aggregation> {agg};
var searchResponse = searchApi.Search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
aggregations: {release_year={buckets=[{key=43, doc_count=2}, {key=42, doc_count=2}, {key=41, doc_count=2}, {key=40, doc_count=1}]}}
hits: class SearchResponseHits {
maxScore: null
total: 3
hits: []
}
profile: null
}
res = await searchApi.search({
index: 'test',
aggs: {
mva_agg: {
terms: { field: "mva_field", size: 2 }
}
}
});
{
"took":0,
"timed_out":false,
"aggregations":
{
"mva_agg":
{
"buckets":
[{
"key":1,
"doc_count":4
},
{
"key":2,
"doc_count":2
}]
}
},
"hits":
{
"total":4,
"hits":[]
}
}
query := map[string]interface{} {};
searchRequest.SetQuery(query);
aggTerms := manticoreclient.NewAggregationTerms()
aggTerms.SetField("mva_field")
aggTerms.SetSize(2)
aggregation := manticoreclient.NewAggregation()
aggregation.setTerms(aggTerms)
searchRequest.SetAggregation(aggregation)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took":0,
"timed_out":false,
"aggregations":
{
"mva_agg":
{
"buckets":
[{
"key":1,
"doc_count":4
},
{
"key":2,
"doc_count":2
}]
}
},
"hits":
{
"total":5,
"hits":[]
}
}
If you have a field of type JSON, you can GROUP BY any node from it. To demonstrate this, let's create a table "products" with a few documents, each having a color in the "meta" JSON field:
create table products(title text, meta json);
insert into products values(0,'nike','{"color":"red"}'),(0,'adidas','{"color":"red"}'),(0,'puma','{"color":"green"}');
This gives us:
SELECT * FROM products;
+---------------------+-------------------+--------+
| id | meta | title |
+---------------------+-------------------+--------+
| 1657851069130080268 | {"color":"red"} | nike |
| 1657851069130080269 | {"color":"red"} | adidas |
| 1657851069130080270 | {"color":"green"} | puma |
+---------------------+-------------------+--------+
To group the products by color, we can simply use GROUP BY meta.color, and to display the corresponding group key in the SELECT list, we can use GROUPBY():
SELECT groupby() color, count(*) from products GROUP BY meta.color;
+-------+----------+
| color | count(*) |
+-------+----------+
| red | 2 |
| green | 1 |
+-------+----------+
POST /search -d '
{
"index" : "products",
"limit": 0,
"aggs" :
{
"color" :
{
"terms" :
{
"field":"meta.color",
"size":100
}
}
}
}
'
{
"took": 0,
"timed_out": false,
"hits": {
"total": 3,
"hits": [
]
},
"aggregations": {
"color": {
"buckets": [
{
"key": "green",
"doc_count": 1
},
{
"key": "red",
"doc_count": 2
}
]
}
}
}
$index->setName('products');
$search = $index->search('');
$search->limit(0);
$search->facet('meta.color','color',100);
$results = $search->get();
print_r($results->getFacets());
Array
(
[color] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => green
[doc_count] => 1
)
[1] => Array
(
[key] => red
[doc_count] => 2
)
)
)
)
res =searchApi.search({"index":"products","limit":0,"aggs":{"color":{"terms":{"field":"meta.color","size":100}}}})
{'aggregations': {u'color': {u'buckets': [{u'doc_count': 1,
u'key': u'green'},
{u'doc_count': 2, u'key': u'red'}]}},
'hits': {'hits': [], 'max_score': None, 'total': 3},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"products","limit":0,"aggs":{"color":{"terms":{"field":"meta.color","size":100}}}});
{"took":0,"timed_out":false,"aggregations":{"color":{"buckets":[{"key":"green","doc_count":1},{"key":"red","doc_count":2}]}},"hits":{"total":3,"hits":[]}}
HashMap<String,Object> aggs = new HashMap<String,Object>(){{
put("color", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","meta.color");
put("size",100);
}});
}});
}};
searchRequest = new SearchRequest();
searchRequest.setIndex("products");
searchRequest.setLimit(0);
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setAggs(aggs);
searchResponse = searchApi.search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
aggregations: {color={buckets=[{key=green, doc_count=1}, {key=red, doc_count=2}]}}
hits: class SearchResponseHits {
maxScore: null
total: 3
hits: []
}
profile: null
}
var agg = new Aggregation("color", "meta.color");
agg.Size = 100;
object query = new { match_all=null };
var searchRequest = new SearchRequest("products", query);
searchRequest.Limit = 0;
searchRequest.Aggs = new List<Aggregation> {agg};
var searchResponse = searchApi.Search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
aggregations: {color={buckets=[{key=green, doc_count=1}, {key=red, doc_count=2}]}}
hits: class SearchResponseHits {
maxScore: null
total: 3
hits: []
}
profile: null
}
res = await searchApi.search({
index: 'test',
aggs: {
json_agg: {
terms: { field: "json_field.year", size: 1 }
}
}
});
{
"took":0,
"timed_out":false,
"aggregations":
{
"json_agg":
{
"buckets":
[{
"key":2000,
"doc_count":2
},
{
"key":2001,
"doc_count":2
}]
}
},
"hits":
{
"total":4,
"hits":[]
}
}
query := map[string]interface{} {};
searchRequest.SetQuery(query);
aggTerms := manticoreclient.NewAggregationTerms()
aggTerms.SetField("json_field.year")
aggTerms.SetSize(2)
aggregation := manticoreclient.NewAggregation()
aggregation.setTerms(aggTerms)
searchRequest.SetAggregation(aggregation)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took":0,
"timed_out":false,
"aggregations":
{
"json_agg":
{
"buckets":
[{
"key":2000,
"doc_count":2
},
{
"key":2001,
"doc_count":2
}]
}
},
"hits":
{
"total":4,
"hits":[]
}
}
Besides COUNT(*), which returns the number of elements in each group, you can use various other aggregation functions:
While COUNT(*) returns the number of all elements in the group, COUNT(DISTINCT field) returns the number of unique values of the field in the group, which may be completely different from the total count. For instance, you can have 100 elements in the group, but all with the same value for a certain field. COUNT(DISTINCT field) helps to determine that. To demonstrate this, let's create a table "students" with the student's name, age, and major:
CREATE TABLE students(name text, age int, major string);
INSERT INTO students values(0,'John',21,'arts'),(0,'William',22,'business'),(0,'Richard',21,'cs'),(0,'Rebecca',22,'cs'),(0,'Monica',21,'arts');
so we have:
MySQL [(none)]> SELECT * from students;
+---------------------+------+----------+---------+
| id | age | major | name |
+---------------------+------+----------+---------+
| 1657851069130080271 | 21 | arts | John |
| 1657851069130080272 | 22 | business | William |
| 1657851069130080273 | 21 | cs | Richard |
| 1657851069130080274 | 22 | cs | Rebecca |
| 1657851069130080275 | 21 | arts | Monica |
+---------------------+------+----------+---------+
In the example, you can see that if we GROUP BY major and display both COUNT(*) and COUNT(DISTINCT age), it becomes clear that there are two students who chose the major "cs" with two unique ages, but for the major "arts", there are also two students, yet only one unique age.
There can be at most one COUNT(DISTINCT) per query.
** By default, counts are approximate **
Actually, some of them are exact, while others are approximate. More on that below.
Manticore supports two algorithms for computing counts of distinct values. One is a legacy algorithm that uses a lot of memory and is usually slow. It collects {group; value} pairs, sorts them, and periodically discards duplicates. The benefit of this approach is that it guarantees exact counts within a plain table. You can enable it by setting the distinct_precision_threshold option to 0.
The other algorithm (enabled by default) loads counts into a hash table and returns its size. If the hash table becomes too large, its contents are moved into a HyperLogLog. This is where the counts become approximate since HyperLogLog is a probabilistic algorithm. The advantage is that the maximum memory usage per group is fixed and depends on the accuracy of the HyperLogLog. The overall memory usage also depends on the max_matches setting, which reflects the number of groups.
The distinct_precision_threshold option sets the threshold below which counts are guaranteed to be exact. The HyperLogLog accuracy setting and the threshold for the "hash table to HyperLogLog" conversion are derived from this setting. It's important to use this option with caution because doubling it will double the maximum memory required for count calculations. The maximum memory usage can be roughly estimated using this formula: 64 * max_matches * distinct_precision_threshold. Note that this is the worst-case scenario, and in most cases, count calculations will use significantly less RAM.
COUNT(DISTINCT) against a distributed table or a real-time table consisting of multiple disk chunks may return inaccurate results, but the result should be accurate for a distributed table consisting of local plain or real-time tables with the same schema (identical set/order of fields, but may have different tokenization settings).
SELECT major, count(*), count(distinct age) FROM students GROUP BY major;
+----------+----------+---------------------+
| major | count(*) | count(distinct age) |
+----------+----------+---------------------+
| arts | 2 | 1 |
| business | 1 | 1 |
| cs | 2 | 2 |
+----------+----------+---------------------+
Often, you want to better understand the contents of each group. You can use GROUP N BY for that, but it would return additional rows you might not want in the output. GROUP_CONCAT() enriches your grouping by concatenating values of a specific field in the group. Let's take the previous example and improve it by displaying all the ages in each group.
GROUP_CONCAT(field) returns the list as comma-separated values.
SELECT major, count(*), count(distinct age), group_concat(age) FROM students GROUP BY major
+----------+----------+---------------------+-------------------+
| major | count(*) | count(distinct age) | group_concat(age) |
+----------+----------+---------------------+-------------------+
| arts | 2 | 1 | 21,21 |
| business | 1 | 1 | 22 |
| cs | 2 | 2 | 21,22 |
+----------+----------+---------------------+-------------------+
Of course, you can also obtain the sum, average, minimum, and maximum values within a group.
SELECT release_year year, sum(rental_rate) sum, min(rental_rate) min, max(rental_rate) max, avg(rental_rate) avg FROM films GROUP BY release_year ORDER BY year asc LIMIT 5;
+------+------------+----------+----------+------------+
| year | sum | min | max | avg |
+------+------------+----------+----------+------------+
| 2000 | 308.030029 | 0.990000 | 4.990000 | 3.17556739 |
| 2001 | 282.090118 | 0.990000 | 4.990000 | 3.09989142 |
| 2002 | 332.919983 | 0.990000 | 4.990000 | 3.08259249 |
| 2003 | 310.940063 | 0.990000 | 4.990000 | 2.93339682 |
| 2004 | 300.920044 | 0.990000 | 4.990000 | 2.78629661 |
+------+------------+----------+----------+------------+
Grouping is done in fixed memory, which depends on the max_matches setting. If max_matches allows for storage of all found groups, the results will be 100% accurate. However, if the value of max_matches is lower, the results will be less accurate.
When parallel processing is involved, it can become more complicated. When pseudo_sharding is enabled and/or when using an RT table with several disk chunks, each chunk or pseudo shard gets a result set that is no larger than max_matches. This can lead to inaccuracies in aggregates and group counts when the result sets from different threads are merged. To fix this, either a larger max_matches value or disabling parallel processing can be used.
Manticore will try to increase max_matches up to max_matches_increase_threshold if it detects that groupby may return inaccurate results. Detection is based on the number of unique values of the groupby attribute, which is retrieved from secondary indexes (if present).
To ensure accurate aggregates and/or group counts when using RT tables or pseudo_sharding, accurate_aggregation can be enabled. This will try to increase max_matches up to the threshold, and if the threshold is not high enough, Manticore will disable parallel processing for the query.
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
| 2005 | 93 |
| 2000 | 97 |
+------+----------+
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=1;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 76 |
+------+----------+
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=2;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 76 |
| 2002 | 74 |
+------+----------+
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=3;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
+------+----------+
Faceted search is as crucial to a modern search application as autocomplete, spell correction, and search keywords highlighting, especially in e-commerce products.

Faceted search comes in handy when dealing with large quantities of data and various interconnected properties, such as size, color, manufacturer, or other factors. When querying vast amounts of data, search results frequently include numerous entries that don't match the user's expectations. Faceted search enables the end user to explicitly define the criteria they want their search results to satisfy.
In Manticore Search, there's an optimization that maintains the result set of the original query and reuses it for each facet calculation. Since the aggregations are applied to an already calculated subset of documents, they're fast, and the total execution time can often be only slightly longer than the initial query. Facets can be added to any query, and the facet can be any attribute or expression. A facet result includes the facet values and the facet counts. Facets can be accessed using the SQL SELECT statement by declaring them at the very end of the query.
The facet values can originate from an attribute, a JSON property within a JSON attribute, or an expression. Facet values can also be aliased, but the alias must be unique across all result sets (main query result set and other facets result sets). The facet value is derived from the aggregated attribute/expression, but it can also come from another attribute/expression.
FACET {expr_list} [BY {expr_list} ] [DISTINCT {field_name}] [ORDER BY {expr | FACET()} {ASC | DESC}] [LIMIT [offset,] count]
Multiple facet declarations must be separated by a whitespace.
Facets can be defined in the aggs node:
"aggs" :
{
"group name" :
{
"terms" :
{
"field":"attribute name",
"size": 1000
}
"sort": [ {"attribute name": { "order":"asc" }} ]
}
}
where:
group name is an alias assigned to the aggregationfield value must contain the name of the attribute or expression being facetedsize specifies the maximum number of buckets to include in the result. When not specified, it inherits the main query's limit. More details can be found in the Size of facet result section.sort specifies an array of attributes and/or additional properties using the same syntax as the "sort" parameter in the main query.The result set will contain an aggregations node with the returned facets, where key is the aggregated value and doc_count is the aggregation count.
"aggregations": {
"group name": {
"buckets": [
{
"key": 10,
"doc_count": 1019
},
{
"key": 9,
"doc_count": 954
},
{
"key": 8,
"doc_count": 1021
},
{
"key": 7,
"doc_count": 1011
},
{
"key": 6,
"doc_count": 997
}
]
}
}
SELECT *, price AS aprice FROM facetdemo LIMIT 10 FACET price LIMIT 10 FACET brand_id LIMIT 5;
+------+-------+----------+---------------------+------------+-------------+---------------------------------------+------------+--------+
| id | price | brand_id | title | brand_name | property | j | categories | aprice |
+------+-------+----------+---------------------+------------+-------------+---------------------------------------+------------+--------+
| 1 | 306 | 1 | Product Ten Three | Brand One | Six_Ten | {"prop1":66,"prop2":91,"prop3":"One"} | 10,11 | 306 |
| 2 | 400 | 10 | Product Three One | Brand Ten | Four_Three | {"prop1":69,"prop2":19,"prop3":"One"} | 13,14 | 400 |
...
| 9 | 560 | 6 | Product Two Five | Brand Six | Eight_Two | {"prop1":90,"prop2":84,"prop3":"One"} | 13,14 | 560 |
| 10 | 229 | 9 | Product Three Eight | Brand Nine | Seven_Three | {"prop1":84,"prop2":39,"prop3":"One"} | 12,13 | 229 |
+------+-------+----------+---------------------+------------+-------------+---------------------------------------+------------+--------+
10 rows in set (0.00 sec)
+-------+----------+
| price | count(*) |
+-------+----------+
| 306 | 7 |
| 400 | 13 |
...
| 229 | 9 |
| 595 | 10 |
+-------+----------+
10 rows in set (0.00 sec)
+----------+----------+
| brand_id | count(*) |
+----------+----------+
| 1 | 1013 |
| 10 | 998 |
| 5 | 1007 |
| 8 | 1033 |
| 7 | 965 |
+----------+----------+
5 rows in set (0.00 sec)
POST /search -d '
{
"index" : "facetdemo",
"query" : {"match_all" : {} },
"limit": 5,
"aggs" :
{
"group_property" :
{
"terms" :
{
"field":"price"
}
},
"group_brand_id" :
{
"terms" :
{
"field":"brand_id"
}
}
}
}
'
{
"took": 3,
"timed_out": false,
"hits": {
"total": 10000,
"hits": [
{
"_id": "1",
"_score": 1,
"_source": {
"price": 197,
"brand_id": 10,
"brand_name": "Brand Ten",
"categories": [
10
]
}
},
...
{
"_id": "5",
"_score": 1,
"_source": {
"price": 805,
"brand_id": 7,
"brand_name": "Brand Seven",
"categories": [
11,
12,
13
]
}
}
]
},
"aggregations": {
"group_property": {
"buckets": [
{
"key": 1000,
"doc_count": 11
},
{
"key": 999,
"doc_count": 12
},
...
{
"key": 991,
"doc_count": 7
}
]
},
"group_brand_id": {
"buckets": [
{
"key": 10,
"doc_count": 1019
},
{
"key": 9,
"doc_count": 954
},
{
"key": 8,
"doc_count": 1021
},
{
"key": 7,
"doc_count": 1011
},
{
"key": 6,
"doc_count": 997
}
]
}
}
}
$index->setName('facetdemo');
$search = $index->search('');
$search->limit(5);
$search->facet('price','price');
$search->facet('brand_id','group_brand_id');
$results = $search->get();
Array
(
[price] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => 1000
[doc_count] => 11
)
[1] => Array
(
[key] => 999
[doc_count] => 12
)
[2] => Array
(
[key] => 998
[doc_count] => 7
)
[3] => Array
(
[key] => 997
[doc_count] => 14
)
[4] => Array
(
[key] => 996
[doc_count] => 8
)
)
)
[group_brand_id] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => 10
[doc_count] => 1019
)
[1] => Array
(
[key] => 9
[doc_count] => 954
)
[2] => Array
(
[key] => 8
[doc_count] => 1021
)
[3] => Array
(
[key] => 7
[doc_count] => 1011
)
[4] => Array
(
[key] => 6
[doc_count] => 997
)
)
)
)
res =searchApi.search({"index":"facetdemo","query":{"match_all":{}},"limit":5,"aggs":{"group_property":{"terms":{"field":"price",}},"group_brand_id":{"terms":{"field":"brand_id"}}}})
{'aggregations': {u'group_brand_id': {u'buckets': [{u'doc_count': 1019,
u'key': 10},
{u'doc_count': 954,
u'key': 9},
{u'doc_count': 1021,
u'key': 8},
{u'doc_count': 1011,
u'key': 7},
{u'doc_count': 997,
u'key': 6}]},
u'group_property': {u'buckets': [{u'doc_count': 11,
u'key': 1000},
{u'doc_count': 12,
u'key': 999},
{u'doc_count': 7,
u'key': 998},
{u'doc_count': 14,
u'key': 997},
{u'doc_count': 8,
u'key': 996}]}},
'hits': {'hits': [{u'_id': u'1',
u'_score': 1,
u'_source': {u'brand_id': 10,
u'brand_name': u'Brand Ten',
u'categories': [10],
u'price': 197,
u'property': u'Six',
u'title': u'Product Eight One'}},
{u'_id': u'2',
u'_score': 1,
u'_source': {u'brand_id': 6,
u'brand_name': u'Brand Six',
u'categories': [12, 13, 14],
u'price': 671,
u'property': u'Four',
u'title': u'Product Nine Seven'}},
{u'_id': u'3',
u'_score': 1,
u'_source': {u'brand_id': 3,
u'brand_name': u'Brand Three',
u'categories': [13, 14, 15],
u'price': 92,
u'property': u'Six',
u'title': u'Product Five Four'}},
{u'_id': u'4',
u'_score': 1,
u'_source': {u'brand_id': 10,
u'brand_name': u'Brand Ten',
u'categories': [11],
u'price': 713,
u'property': u'Five',
u'title': u'Product Eight Nine'}},
{u'_id': u'5',
u'_score': 1,
u'_source': {u'brand_id': 7,
u'brand_name': u'Brand Seven',
u'categories': [11, 12, 13],
u'price': 805,
u'property': u'Two',
u'title': u'Product Ten Three'}}],
'max_score': None,
'total': 10000},
'profile': None,
'timed_out': False,
'took': 4}
res = await searchApi.search({"index":"facetdemo","query":{"match_all":{}},"limit":5,"aggs":{"group_property":{"terms":{"field":"price",}},"group_brand_id":{"terms":{"field":"brand_id"}}}});
{"took":0,"timed_out":false,"hits":{"total":10000,"hits":[{"_id":"1","_score":1,"_source":{"price":197,"brand_id":10,"brand_name":"Brand Ten","categories":[10],"title":"Product Eight One","property":"Six"}},{"_id":"2","_score":1,"_source":{"price":671,"brand_id":6,"brand_name":"Brand Six","categories":[12,13,14],"title":"Product Nine Seven","property":"Four"}},{"_id":"3","_score":1,"_source":{"price":92,"brand_id":3,"brand_name":"Brand Three","categories":[13,14,15],"title":"Product Five Four","property":"Six"}},{"_id":"4","_score":1,"_source":{"price":713,"brand_id":10,"brand_name":"Brand Ten","categories":[11],"title":"Product Eight Nine","property":"Five"}},{"_id":"5","_score":1,"_source":{"price":805,"brand_id":7,"brand_name":"Brand Seven","categories":[11,12,13],"title":"Product Ten Three","property":"Two"}}]}}
aggs = new HashMap<String,Object>(){{
put("group_property", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","price");
}});
}});
put("group_brand_id", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","brand_id");
}});
}});
}};
searchRequest = new SearchRequest();
searchRequest.setIndex("facetdemo");
searchRequest.setLimit(5);
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setAggs(aggs);
searchResponse = searchApi.search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
aggregations: {group_property={buckets=[{key=1000, doc_count=11}, {key=999, doc_count=12}, {key=998, doc_count=7}, {key=997, doc_count=14}, {key=996, doc_count=8}]}, group_brand_id={buckets=[{key=10, doc_count=1019}, {key=9, doc_count=954}, {key=8, doc_count=1021}, {key=7, doc_count=1011}, {key=6, doc_count=997}]}}
hits: class SearchResponseHits {
maxScore: null
total: 10000
hits: [{_id=1, _score=1, _source={price=197, brand_id=10, brand_name=Brand Ten, categories=[10], title=Product Eight One, property=Six}}, {_id=2, _score=1, _source={price=671, brand_id=6, brand_name=Brand Six, categories=[12, 13, 14], title=Product Nine Seven, property=Four}}, {_id=3, _score=1, _source={price=92, brand_id=3, brand_name=Brand Three, categories=[13, 14, 15], title=Product Five Four, property=Six}}, {_id=4, _score=1, _source={price=713, brand_id=10, brand_name=Brand Ten, categories=[11], title=Product Eight Nine, property=Five}}, {_id=5, _score=1, _source={price=805, brand_id=7, brand_name=Brand Seven, categories=[11, 12, 13], title=Product Ten Three, property=Two}}]
}
profile: null
}
var agg1 = new Aggregation("group_property", "price");
var agg2 = new Aggregation("group_brand_id", "brand_id");
object query = new { match_all=null };
var searchRequest = new SearchRequest("facetdemo", query);
searchRequest.Limit = 5;
searchRequest.Aggs = new List<Aggregation> {agg1, agg2};
var searchResponse = searchApi.Search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
aggregations: {group_property={buckets=[{key=1000, doc_count=11}, {key=999, doc_count=12}, {key=998, doc_count=7}, {key=997, doc_count=14}, {key=996, doc_count=8}]}, group_brand_id={buckets=[{key=10, doc_count=1019}, {key=9, doc_count=954}, {key=8, doc_count=1021}, {key=7, doc_count=1011}, {key=6, doc_count=997}]}}
hits: class SearchResponseHits {
maxScore: null
total: 10000
hits: [{_id=1, _score=1, _source={price=197, brand_id=10, brand_name=Brand Ten, categories=[10], title=Product Eight One, property=Six}}, {_id=2, _score=1, _source={price=671, brand_id=6, brand_name=Brand Six, categories=[12, 13, 14], title=Product Nine Seven, property=Four}}, {_id=3, _score=1, _source={price=92, brand_id=3, brand_name=Brand Three, categories=[13, 14, 15], title=Product Five Four, property=Six}}, {_id=4, _score=1, _source={price=713, brand_id=10, brand_name=Brand Ten, categories=[11], title=Product Eight Nine, property=Five}}, {_id=5, _score=1, _source={price=805, brand_id=7, brand_name=Brand Seven, categories=[11, 12, 13], title=Product Ten Three, property=Two}}]
}
profile: null
}
res = await searchApi.search({
index: 'test',
query: { match_all:{} },
aggs: {
name_group: {
terms: { field : 'name' }
},
cat_group: {
terms: { field: 'cat' }
}
}
});
{
"took": 0,
"timed_out": false,
"hits": {
"total": 5,
"hits": [
{
"_id": "1",
"_score": 1,
"_source": {
"content": "Text 1",
"name": "Doc 1",
"cat": 1
}
},
...
{
"_id": "5",
"_score": 1,
"_source": {
"content": "Text 5",
"name": "Doc 5",
"cat": 4
}
}
]
},
"aggregations": {
"name_group": {
"buckets": [
{
"key": "Doc 1",
"doc_count": 1
},
...
{
"key": "Doc 5",
"doc_count": 1
}
]
},
"cat_group": {
"buckets": [
{
"key": 1,
"doc_count": 2
},
...
{
"key": 4,
"doc_count": 1
}
]
}
}
}
query := map[string]interface{} {}
searchRequest.SetQuery(query)
aggByName := manticoreclient.NewAggregation()
aggTerms := manticoreclient.NewAggregationTerms()
aggTerms.SetField("name")
aggByName.SetTerms(aggTerms)
aggByCat := manticoreclient.NewAggregation()
aggTerms.SetField("cat")
aggByCat.SetTerms(aggTerms)
aggs := map[string]Aggregation{} { "name_group": aggByName, "cat_group": aggByCat }
searchRequest.SetAggs(aggs)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took": 0,
"timed_out": false,
"hits": {
"total": 5,
"hits": [
{
"_id": "1",
"_score": 1,
"_source": {
"content": "Text 1",
"name": "Doc 1",
"cat": 1
}
},
...
{
"_id": "5",
"_score": 1,
"_source": {
"content": "Text 5",
"name": "Doc 5",
"cat": 4
}
}
]
},
"aggregations": {
"name_group": {
"buckets": [
{
"key": "Doc 1",
"doc_count": 1
},
...
{
"key": "Doc 5",
"doc_count": 1
}
]
},
"cat_group": {
"buckets": [
{
"key": 1,
"doc_count": 2
},
...
{
"key": 4,
"doc_count": 1
}
]
}
}
}
Data can be faceted by aggregating another attribute or expression. For example if the documents contain both the brand id and name, we can return in facet the brand names, but aggregate the brand ids. This can be done by using FACET {expr1} BY {expr2}
SELECT * FROM facetdemo FACET brand_name by brand_id;
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+
| id | price | brand_id | title | brand_name | property | j | categories |
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+
| 1 | 306 | 1 | Product Ten Three | Brand One | Six_Ten | {"prop1":66,"prop2":91,"prop3":"One"} | 10,11 |
| 2 | 400 | 10 | Product Three One | Brand Ten | Four_Three | {"prop1":69,"prop2":19,"prop3":"One"} | 13,14 |
....
| 19 | 855 | 1 | Product Seven Two | Brand One | Eight_Seven | {"prop1":63,"prop2":78,"prop3":"One"} | 10,11,12 |
| 20 | 31 | 9 | Product Four One | Brand Nine | Ten_Four | {"prop1":79,"prop2":42,"prop3":"One"} | 12,13,14 |
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+
20 rows in set (0.00 sec)
+-------------+----------+
| brand_name | count(*) |
+-------------+----------+
| Brand One | 1013 |
| Brand Ten | 998 |
| Brand Five | 1007 |
| Brand Nine | 944 |
| Brand Two | 990 |
| Brand Six | 1039 |
| Brand Three | 1016 |
| Brand Four | 994 |
| Brand Eight | 1033 |
| Brand Seven | 965 |
+-------------+----------+
10 rows in set (0.00 sec)
If you need to remove duplicates from the buckets returned by FACET, you can use DISTINCT field_name, where field_name is the field by which you want to perform deduplication. It can also be id (which is the default) if you make a FACET query against a distributed table and are not sure whether you have unique ids in the tables (the tables should be local and have the same schema).
If you have multiple FACET declarations in your query, field_name should be the same in all of them.
DISTINCT returns an additional column count(distinct ...) before the column count(*), allowing you to obtain both results without needing to make another query.
SELECT brand_name, property FROM facetdemo FACET brand_name distinct property;
+-------------+----------+
| brand_name | property |
+-------------+----------+
| Brand Nine | Four |
| Brand Ten | Four |
| Brand One | Five |
| Brand Seven | Nine |
| Brand Seven | Seven |
| Brand Three | Seven |
| Brand Nine | Five |
| Brand Three | Eight |
| Brand Two | Eight |
| Brand Six | Eight |
| Brand Ten | Four |
| Brand Ten | Two |
| Brand Four | Ten |
| Brand One | Nine |
| Brand Four | Eight |
| Brand Nine | Seven |
| Brand Four | Five |
| Brand Three | Four |
| Brand Four | Two |
| Brand Four | Eight |
+-------------+----------+
20 rows in set (0.00 sec)
+-------------+--------------------------+----------+
| brand_name | count(distinct property) | count(*) |
+-------------+--------------------------+----------+
| Brand Nine | 3 | 3 |
| Brand Ten | 2 | 3 |
| Brand One | 2 | 2 |
| Brand Seven | 2 | 2 |
| Brand Three | 3 | 3 |
| Brand Two | 1 | 1 |
| Brand Six | 1 | 1 |
| Brand Four | 4 | 5 |
+-------------+--------------------------+----------+
8 rows in set (0.00 sec)
Facets can aggregate over expressions. A classic example is the segmentation of prices by specific ranges:
SELECT * FROM facetdemo FACET INTERVAL(price,200,400,600,800) AS price_range ;
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+-------------+
| id | price | brand_id | title | brand_name | property | j | categories | price_range |
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+-------------+
| 1 | 306 | 1 | Product Ten Three | Brand One | Six_Ten | {"prop1":66,"prop2":91,"prop3":"One"} | 10,11 | 1 |
...
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+-------------+
20 rows in set (0.00 sec)
+-------------+----------+
| price_range | count(*) |
+-------------+----------+
| 0 | 1885 |
| 3 | 1973 |
| 4 | 2100 |
| 2 | 1999 |
| 1 | 2043 |
+-------------+----------+
5 rows in set (0.01 sec)
POST /search -d '
{
"index": "facetdemo",
"query":
{
"match_all": {}
},
"expressions":
{
"price_range": "INTERVAL(price,200,400,600,800)"
},
"aggs":
{
"group_property":
{
"terms":
{
"field": "price_range"
}
}
}
}
{
"took": 3,
"timed_out": false,
"hits": {
"total": 10000,
"hits": [
{
"_id": "1",
"_score": 1,
"_source": {
"price": 197,
"brand_id": 10,
"brand_name": "Brand Ten",
"categories": [
10
],
"price_range": 0
}
},
...
{
"_id": "20",
"_score": 1,
"_source": {
"price": 227,
"brand_id": 3,
"brand_name": "Brand Three",
"categories": [
12,
13
],
"price_range": 1
}
}
]
},
"aggregations": {
"group_property": {
"buckets": [
{
"key": 4,
"doc_count": 2100
},
{
"key": 3,
"doc_count": 1973
},
{
"key": 2,
"doc_count": 1999
},
{
"key": 1,
"doc_count": 2043
},
{
"key": 0,
"doc_count": 1885
}
]
}
}
}
$index->setName('facetdemo');
$search = $index->search('');
$search->limit(5);
$search->expression('price_range','INTERVAL(price,200,400,600,800)');
$search->facet('price_range','group_property');
$results = $search->get();
print_r($results->getFacets());
Array
(
[group_property] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => 4
[doc_count] => 2100
)
[1] => Array
(
[key] => 3
[doc_count] => 1973
)
[2] => Array
(
[key] => 2
[doc_count] => 1999
)
[3] => Array
(
[key] => 1
[doc_count] => 2043
)
[4] => Array
(
[key] => 0
[doc_count] => 1885
)
)
)
)
res =searchApi.search({"index":"facetdemo","query":{"match_all":{}},"expressions":{"price_range":"INTERVAL(price,200,400,600,800)"},"aggs":{"group_property":{"terms":{"field":"price_range"}}}})
{'aggregations': {u'group_brand_id': {u'buckets': [{u'doc_count': 1019,
u'key': 10},
{u'doc_count': 954,
u'key': 9},
{u'doc_count': 1021,
u'key': 8},
{u'doc_count': 1011,
u'key': 7},
{u'doc_count': 997,
u'key': 6}]},
u'group_property': {u'buckets': [{u'doc_count': 11,
u'key': 1000},
{u'doc_count': 12,
u'key': 999},
{u'doc_count': 7,
u'key': 998},
{u'doc_count': 14,
u'key': 997},
{u'doc_count': 8,
u'key': 996}]}},
'hits': {'hits': [{u'_id': u'1',
u'_score': 1,
u'_source': {u'brand_id': 10,
u'brand_name': u'Brand Ten',
u'categories': [10],
u'price': 197,
u'property': u'Six',
u'title': u'Product Eight One'}},
{u'_id': u'2',
u'_score': 1,
u'_source': {u'brand_id': 6,
u'brand_name': u'Brand Six',
u'categories': [12, 13, 14],
u'price': 671,
u'property': u'Four',
u'title': u'Product Nine Seven'}},
{u'_id': u'3',
u'_score': 1,
u'_source': {u'brand_id': 3,
u'brand_name': u'Brand Three',
u'categories': [13, 14, 15],
u'price': 92,
u'property': u'Six',
u'title': u'Product Five Four'}},
{u'_id': u'4',
u'_score': 1,
u'_source': {u'brand_id': 10,
u'brand_name': u'Brand Ten',
u'categories': [11],
u'price': 713,
u'property': u'Five',
u'title': u'Product Eight Nine'}},
{u'_id': u'5',
u'_score': 1,
u'_source': {u'brand_id': 7,
u'brand_name': u'Brand Seven',
u'categories': [11, 12, 13],
u'price': 805,
u'property': u'Two',
u'title': u'Product Ten Three'}}],
'max_score': None,
'total': 10000},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"facetdemo","query":{"match_all":{}},"expressions":{"price_range":"INTERVAL(price,200,400,600,800)"},"aggs":{"group_property":{"terms":{"field":"price_range"}}}});
{"took":0,"timed_out":false,"hits":{"total":10000,"hits":[{"_id":"1","_score":1,"_source":{"price":197,"brand_id":10,"brand_name":"Brand Ten","categories":[10],"title":"Product Eight One","property":"Six","price_range":0}},{"_id":"2","_score":1,"_source":{"price":671,"brand_id":6,"brand_name":"Brand Six","categories":[12,13,14],"title":"Product Nine Seven","property":"Four","price_range":3}},{"_id":"3","_score":1,"_source":{"price":92,"brand_id":3,"brand_name":"Brand Three","categories":[13,14,15],"title":"Product Five Four","property":"Six","price_range":0}},{"_id":"4","_score":1,"_source":{"price":713,"brand_id":10,"brand_name":"Brand Ten","categories":[11],"title":"Product Eight Nine","property":"Five","price_range":3}},{"_id":"5","_score":1,"_source":{"price":805,"brand_id":7,"brand_name":"Brand Seven","categories":[11,12,13],"title":"Product Ten Three","property":"Two","price_range":4}},{"_id":"6","_score":1,"_source":{"price":420,"brand_id":2,"brand_name":"Brand Two","categories":[10,11],"title":"Product Two One","property":"Six","price_range":2}},{"_id":"7","_score":1,"_source":{"price":412,"brand_id":9,"brand_name":"Brand Nine","categories":[10],"title":"Product Four Nine","property":"Eight","price_range":2}},{"_id":"8","_score":1,"_source":{"price":300,"brand_id":9,"brand_name":"Brand Nine","categories":[13,14,15],"title":"Product Eight Four","property":"Five","price_range":1}},{"_id":"9","_score":1,"_source":{"price":728,"brand_id":1,"brand_name":"Brand One","categories":[11],"title":"Product Nine Six","property":"Four","price_range":3}},{"_id":"10","_score":1,"_source":{"price":622,"brand_id":3,"brand_name":"Brand Three","categories":[10,11],"title":"Product Six Seven","property":"Two","price_range":3}},{"_id":"11","_score":1,"_source":{"price":462,"brand_id":5,"brand_name":"Brand Five","categories":[10,11],"title":"Product Ten Two","property":"Eight","price_range":2}},{"_id":"12","_score":1,"_source":{"price":939,"brand_id":7,"brand_name":"Brand Seven","categories":[12,13],"title":"Product Nine Seven","property":"Six","price_range":4}},{"_id":"13","_score":1,"_source":{"price":948,"brand_id":8,"brand_name":"Brand Eight","categories":[12],"title":"Product Ten One","property":"Six","price_range":4}},{"_id":"14","_score":1,"_source":{"price":900,"brand_id":9,"brand_name":"Brand Nine","categories":[12,13,14],"title":"Product Ten Nine","property":"Three","price_range":4}},{"_id":"15","_score":1,"_source":{"price":224,"brand_id":3,"brand_name":"Brand Three","categories":[13],"title":"Product Two Six","property":"Four","price_range":1}},{"_id":"16","_score":1,"_source":{"price":713,"brand_id":10,"brand_name":"Brand Ten","categories":[12],"title":"Product Two Four","property":"Six","price_range":3}},{"_id":"17","_score":1,"_source":{"price":510,"brand_id":2,"brand_name":"Brand Two","categories":[10],"title":"Product Ten Two","property":"Seven","price_range":2}},{"_id":"18","_score":1,"_source":{"price":702,"brand_id":10,"brand_name":"Brand Ten","categories":[12,13],"title":"Product Nine One","property":"Three","price_range":3}},{"_id":"19","_score":1,"_source":{"price":836,"brand_id":4,"brand_name":"Brand Four","categories":[10,11,12],"title":"Product Four Five","property":"Two","price_range":4}},{"_id":"20","_score":1,"_source":{"price":227,"brand_id":3,"brand_name":"Brand Three","categories":[12,13],"title":"Product Three Four","property":"Ten","price_range":1}}]}}
searchRequest = new SearchRequest();
expressions = new HashMap<String,Object>(){{
put("price_range","INTERVAL(price,200,400,600,800)");
}};
searchRequest.setExpressions(expressions);
aggs = new HashMap<String,Object>(){{
put("group_property", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","price_range");
}});
}});
}};
searchRequest.setIndex("facetdemo");
searchRequest.setLimit(5);
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setAggs(aggs);
searchResponse = searchApi.search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
aggregations: {group_property={buckets=[{key=4, doc_count=2100}, {key=3, doc_count=1973}, {key=2, doc_count=1999}, {key=1, doc_count=2043}, {key=0, doc_count=1885}]}}
hits: class SearchResponseHits {
maxScore: null
total: 10000
hits: [{_id=1, _score=1, _source={price=197, brand_id=10, brand_name=Brand Ten, categories=[10], title=Product Eight One, property=Six, price_range=0}}, {_id=2, _score=1, _source={price=671, brand_id=6, brand_name=Brand Six, categories=[12, 13, 14], title=Product Nine Seven, property=Four, price_range=3}}, {_id=3, _score=1, _source={price=92, brand_id=3, brand_name=Brand Three, categories=[13, 14, 15], title=Product Five Four, property=Six, price_range=0}}, {_id=4, _score=1, _source={price=713, brand_id=10, brand_name=Brand Ten, categories=[11], title=Product Eight Nine, property=Five, price_range=3}}, {_id=5, _score=1, _source={price=805, brand_id=7, brand_name=Brand Seven, categories=[11, 12, 13], title=Product Ten Three, property=Two, price_range=4}}]
}
profile: null
}
var expr = new Dictionary<string, string> { {"price_range", "INTERVAL(price,200,400,600,800"} } ;
var agg = new Aggregation("group_property", "price_range");
object query = new { match_all=null };
var searchRequest = new SearchRequest("facetdemo", query);
searchRequest.Limit = 5;
searchRequest.Expressions = new List<Object> {expr};
searchRequest.Aggs = new List<Aggregation> {agg};
var searchResponse = searchApi.Search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
aggregations: {group_property={buckets=[{key=4, doc_count=2100}, {key=3, doc_count=1973}, {key=2, doc_count=1999}, {key=1, doc_count=2043}, {key=0, doc_count=1885}]}}
hits: class SearchResponseHits {
maxScore: null
total: 10000
hits: [{_id=1, _score=1, _source={price=197, brand_id=10, brand_name=Brand Ten, categories=[10], title=Product Eight One, property=Six, price_range=0}}, {_id=2, _score=1, _source={price=671, brand_id=6, brand_name=Brand Six, categories=[12, 13, 14], title=Product Nine Seven, property=Four, price_range=3}}, {_id=3, _score=1, _source={price=92, brand_id=3, brand_name=Brand Three, categories=[13, 14, 15], title=Product Five Four, property=Six, price_range=0}}, {_id=4, _score=1, _source={price=713, brand_id=10, brand_name=Brand Ten, categories=[11], title=Product Eight Nine, property=Five, price_range=3}}, {_id=5, _score=1, _source={price=805, brand_id=7, brand_name=Brand Seven, categories=[11, 12, 13], title=Product Ten Three, property=Two, price_range=4}}]
}
profile: null
}
res = await searchApi.search({
index: 'test',
query: { match_all:{} },
expressions: { cat_range: "INTERVAL(cat,1,3)" }
aggs: {
expr_group: {
terms: { field : 'cat_range' }
}
}
});
{
"took": 0,
"timed_out": false,
"hits": {
"total": 5,
"hits": [
{
"_id": "1",
"_score": 1,
"_source": {
"content": "Text 1",
"name": "Doc 1",
"cat": 1,
"cat_range": 1
}
},
...
{
"_id": "5",
"_score": 1,
"_source": {
"content": "Text 5",
"name": "Doc 5",
"cat": 4,
"cat_range": 2,
}
}
]
},
"aggregations": {
"expr_group": {
"buckets": [
{
"key": 0,
"doc_count": 0
},
{
"key": 1,
"doc_count": 3
},
{
"key": 2,
"doc_count": 2
}
]
}
}
}
query := map[string]interface{} {}
searchRequest.SetQuery(query)
exprs := map[string]string{} { "cat_range": "INTERVAL(cat,1,3)" }
searchRequest.SetExpressions(exprs)
aggByExpr := manticoreclient.NewAggregation()
aggTerms := manticoreclient.NewAggregationTerms()
aggTerms.SetField("cat_range")
aggByExpr.SetTerms(aggTerms)
aggs := map[string]Aggregation{} { "expr_group": aggByExpr }
searchRequest.SetAggs(aggs)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took": 0,
"timed_out": false,
"hits": {
"total": 5,
"hits": [
{
"_id": "1",
"_score": 1,
"_source": {
"content": "Text 1",
"name": "Doc 1",
"cat": 1,
"cat_range": 1
}
},
...
{
"_id": "5",
"_score": 1,
"_source": {
"content": "Text 5",
"name": "Doc 5",
"cat": 4,
"cat_range": 2
}
}
]
},
"aggregations": {
"expr_group": {
"buckets": [
{
"key": 0,
"doc_count": 0
},
{
"key": 1,
"doc_count": 3
},
{
"key": 2,
"doc_count": 2
}
]
}
}
}
Facets can aggregate over multi-level grouping, with the result set being the same as if the query performed a multi-level grouping:
SELECT *,INTERVAL(price,200,400,600,800) AS price_range FROM facetdemo
FACET price_range AS price_range,brand_name ORDER BY brand_name asc;
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+-------------+
| id | price | brand_id | title | brand_name | property | j | categories | price_range |
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+-------------+
| 1 | 306 | 1 | Product Ten Three | Brand One | Six_Ten | {"prop1":66,"prop2":91,"prop3":"One"} | 10,11 | 1 |
...
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+-------------+
20 rows in set (0.00 sec)
+--------------+-------------+----------+
| fprice_range | brand_name | count(*) |
+--------------+-------------+----------+
| 1 | Brand Eight | 197 |
| 4 | Brand Eight | 235 |
| 3 | Brand Eight | 203 |
| 2 | Brand Eight | 201 |
| 0 | Brand Eight | 197 |
| 4 | Brand Five | 230 |
| 2 | Brand Five | 197 |
| 1 | Brand Five | 204 |
| 3 | Brand Five | 193 |
| 0 | Brand Five | 183 |
| 1 | Brand Four | 195 |
...
Facets can aggregate over histogram values by constructing fixed-size buckets over the values.
The key function is:
key_of_the_bucket = interval + offset * floor ( ( value - offset ) / interval )
The histogram argument interval must be positive, and the histogram argument offset must be positive and less than interval. By default, the buckets are returned as an array. The histogram argument keyed makes the response a dictionary with the bucket keys.
SELECT COUNT(*), HISTOGRAM(price, {hist_interval=100}) as price_range FROM facets GROUP BY price_range ORDER BY price_range ASC;
+----------+-------------+
| count(*) | price_range |
+----------+-------------+
| 5 | 0 |
| 5 | 100 |
| 1 | 300 |
| 4 | 400 |
| 1 | 500 |
| 3 | 700 |
| 1 | 900 |
+----------+-------------+
POST /search -d '
{
"size": 0,
"index": "facets",
"aggs": {
"price_range": {
"histogram": {
"field": "price",
"interval": 300
}
}
}
}'
{
"took": 0,
"timed_out": false,
"hits": {
"total": 20,
"total_relation": "eq",
"hits": []
},
"aggregations": {
"price_range": {
"buckets": [
{
"key": 0,
"doc_count": 10
},
{
"key": 300,
"doc_count": 6
},
{
"key": 600,
"doc_count": 3
},
{
"key": 900,
"doc_count": 1
}
]
}
}
}
POST /search -d '
{
"size": 0,
"index": "facets",
"aggs": {
"price_range": {
"histogram": {
"field": "price",
"interval": 300,
"keyed": true
}
}
}
}'
{
"took": 0,
"timed_out": false,
"hits": {
"total": 20,
"total_relation": "eq",
"hits": []
},
"aggregations": {
"price_range": {
"buckets": {
"0": {
"key": 0,
"doc_count": 10
},
"300": {
"key": 300,
"doc_count": 6
},
"600": {
"key": 600,
"doc_count": 3
},
"900": {
"key": 900,
"doc_count": 1
}
}
}
}
}
Facets can aggregate over histogram date values, which is similar to the normal histogram. The difference is that the interval is specified using a date or time expression. Such expressions require special support because the intervals are not always of fixed length. Values are rounded to the closest bucket using the following key function:
key_of_the_bucket = interval * floor ( value / interval )
The histogram parameter calendar_interval understands months to have different amounts of days. The accepted intervals are described in the date_histogram expression. By default, the buckets are returned as an array. The histogram argument keyed makes the response a dictionary with the bucket keys.
SELECT count(*), DATE_HISTOGRAM(tm, {calendar_interval='month'}) AS months FROM idx_dates GROUP BY months ORDER BY months ASC
+----------+------------+
| count(*) | months |
+----------+------------+
| 442 | 1485907200 |
| 744 | 1488326400 |
| 720 | 1491004800 |
| 230 | 1493596800 |
+----------+------------+
POST /search -d '
{
"index": "idx_dates",
"size": 0,
"aggs": {
"months": {
"date_histogram": {
"field": "tm",
"keyed": true,
"calendar_interval": "month"
}
}
}
}'
{
"timed_out": false,
"hits": {
"total": 2136,
"total_relation": "eq",
"hits": []
},
"aggregations": {
"months": {
"buckets": {
"2017-02-01T00:00:00": {
"key": 1485907200,
"key_as_string": "2017-02-01T00:00:00",
"doc_count": 442
},
"2017-03-01T00:00:00": {
"key": 1488326400,
"key_as_string": "2017-03-01T00:00:00",
"doc_count": 744
},
"2017-04-01T00:00:00": {
"key": 1491004800,
"key_as_string": "2017-04-01T00:00:00",
"doc_count": 720
},
"2017-05-01T00:00:00": {
"key": 1493596800,
"key_as_string": "2017-05-01T00:00:00",
"doc_count": 230
}
}
}
}
}
Facets can aggregate over a set of ranges. The values are checked against the bucket range, where each bucket includes the from value and excludes the to value from the range.
Setting the keyed property to true makes the response a dictionary with the bucket keys rather than an array.
SELECT COUNT(*), RANGE(price, {range_to=150},{range_from=150,range_to=300},{range_from=300}) price_range FROM facets GROUP BY price_range ORDER BY price_range ASC;
+----------+-------------+
| count(*) | price_range |
+----------+-------------+
| 8 | 0 |
| 2 | 1 |
| 10 | 2 |
+----------+-------------+
POST /search -d '
{
"size": 0,
"index": "facets",
"aggs": {
"price_range": {
"range": {
"field": "price",
"ranges": [
{
"to": 99
},
{
"from": 99,
"to": 550
},
{
"from": 550
}
]
}
}
}
}'
{
"took": 0,
"timed_out": false,
"hits": {
"total": 20,
"total_relation": "eq",
"hits": []
},
"aggregations": {
"price_range": {
"buckets": [
{
"key": "*-99",
"to": "99",
"doc_count": 5
},
{
"key": "99-550",
"from": "99",
"to": "550",
"doc_count": 11
},
{
"key": "550-*",
"from": "550",
"doc_count": 4
}
]
}
}
}
POST /search -d '
{
"size":0,
"index":"facets",
"aggs":{
"price_range":{
"range":{
"field":"price",
"keyed":true,
"ranges":[
{
"from":100,
"to":399
},
{
"from":399
}
]
}
}
}
}'
{
"took": 0,
"timed_out": false,
"hits": {
"total": 20,
"total_relation": "eq",
"hits": []
},
"aggregations": {
"price_range": {
"buckets": {
"100-399": {
"from": "100",
"to": "399",
"doc_count": 6
},
"399-*": {
"from": "399",
"doc_count": 9
}
}
}
}
}
Facets can aggregate over a set of date ranges, which is similar to the normal range. The difference is that the from and to values can be expressed in Date math expressions. This aggregation includes the from value and excludes the to value for each range. Setting the keyed property to true makes the response a dictionary with the bucket keys rather than an array.
SELECT COUNT(*), DATE_RANGE(tm, {range_to='2017||+2M/M'},{range_from='2017||+2M/M',range_to='2017||+5M/M'},{range_from='2017||+5M/M'}) AS points FROM idx_dates GROUP BY points ORDER BY points ASC;
+----------+--------+
| count(*) | points |
+----------+--------+
| 442 | 0 |
| 1464 | 1 |
| 230 | 2 |
+----------+--------+
POST /search -d '
{
"index": "idx_dates",
"size": 0,
"aggs": {
"points": {
"date_range": {
"field": "tm",
"keyed": true,
"ranges": [
{
"to": "2017||+2M/M"
},
{
"from": "2017||+2M/M",
"to": "2017||+4M/M"
},
{
"from": "2017||+4M/M",
"to": "2017||+5M/M"
},
{
"from": "2017||+5M/M"
}
]
}
}
}
}'
{
"timed_out": false,
"hits": {
"total": 2136,
"total_relation": "eq",
"hits": []
},
"aggregations": {
"points": {
"buckets": {
"*-2017-03-01T00:00:00": {
"to": "2017-03-01T00:00:00",
"doc_count": 442
},
"2017-03-01T00:00:00-2017-04-01T00:00:00": {
"from": "2017-03-01T00:00:00",
"to": "2017-04-01T00:00:00",
"doc_count": 744
},
"2017-04-01T00:00:00-2017-05-01T00:00:00": {
"from": "2017-04-01T00:00:00",
"to": "2017-05-01T00:00:00",
"doc_count": 720
},
"2017-05-01T00:00:00-*": {
"from": "2017-05-01T00:00:00",
"doc_count": 230
}
}
}
}
}
Facets support the ORDER BY clause just like a standard query. Each facet can have its own ordering, and the facet ordering doesn't affect the main result set's ordering, which is determined by the main query's ORDER BY. Sorting can be done on attribute name, count (using COUNT(*)), or the special FACET() function, which provides the aggregated data values.
SELECT * FROM facetdemo
FACET brand_name BY brand_id ORDER BY FACET() ASC
FACET brand_name BY brand_id ORDER BY brand_name ASC
FACET brand_name BY brand_id order BY COUNT(*) DESC;
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+
| id | price | brand_id | title | brand_name | property | j | categories |
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+
| 1 | 306 | 1 | Product Ten Three | Brand One | Six_Ten | {"prop1":66,"prop2":91,"prop3":"One"} | 10,11 |
...
| 20 | 31 | 9 | Product Four One | Brand Nine | Ten_Four | {"prop1":79,"prop2":42,"prop3":"One"} | 12,13,14 |
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+
20 rows in set (0.01 sec)
+-------------+----------+
| brand_name | count(*) |
+-------------+----------+
| Brand One | 1013 |
| Brand Two | 990 |
| Brand Three | 1016 |
| Brand Four | 994 |
| Brand Five | 1007 |
| Brand Six | 1039 |
| Brand Seven | 965 |
| Brand Eight | 1033 |
| Brand Nine | 944 |
| Brand Ten | 998 |
+-------------+----------+
10 rows in set (0.01 sec)
+-------------+----------+
| brand_name | count(*) |
+-------------+----------+
| Brand Eight | 1033 |
| Brand Five | 1007 |
| Brand Four | 994 |
| Brand Nine | 944 |
| Brand One | 1013 |
| Brand Seven | 965 |
| Brand Six | 1039 |
| Brand Ten | 998 |
| Brand Three | 1016 |
| Brand Two | 990 |
+-------------+----------+
10 rows in set (0.01 sec)
+-------------+----------+
| brand_name | count(*) |
+-------------+----------+
| Brand Six | 1039 |
| Brand Eight | 1033 |
| Brand Three | 1016 |
| Brand One | 1013 |
| Brand Five | 1007 |
| Brand Ten | 998 |
| Brand Four | 994 |
| Brand Two | 990 |
| Brand Seven | 965 |
| Brand Nine | 944 |
+-------------+----------+
10 rows in set (0.01 sec)
POST /search -d '
{
"index":"table_name",
"aggs":{
"group_property":{
"terms":{
"field":"a"
},
"sort":[
{
"count(*)":{
"order":"desc"
}
}
]
}
}
}'
{
"took": 0,
"timed_out": false,
"hits": {
"total": 6,
"total_relation": "eq",
"hits": [
{
"_id": "1515697460415037554",
"_score": 1,
"_source": {
"a": 1
}
},
{
"_id": "1515697460415037555",
"_score": 1,
"_source": {
"a": 2
}
},
{
"_id": "1515697460415037556",
"_score": 1,
"_source": {
"a": 2
}
},
{
"_id": "1515697460415037557",
"_score": 1,
"_source": {
"a": 3
}
},
{
"_id": "1515697460415037558",
"_score": 1,
"_source": {
"a": 3
}
},
{
"_id": "1515697460415037559",
"_score": 1,
"_source": {
"a": 3
}
}
]
},
"aggregations": {
"group_property": {
"buckets": [
{
"key": 3,
"doc_count": 3
},
{
"key": 2,
"doc_count": 2
},
{
"key": 1,
"doc_count": 1
}
]
}
}
}
By default, each facet result set is limited to 20 values. The number of facet values can be controlled with the LIMIT clause individually for each facet by providing either a number of values to return in the format LIMIT count or with an offset as LIMIT offset, count.
The maximum facet values that can be returned is limited by the query's max_matches setting. If you want to implement dynamic max_matches (limiting max_matches to offset + per page for better performance), it must be taken into account that a too low max_matches value can affect the number of facet values. In this case, a minimum max_matches value should be used that is sufficient to cover the number of facet values.
SELECT * FROM facetdemo
FACET brand_name BY brand_id ORDER BY FACET() ASC LIMIT 0,1
FACET brand_name BY brand_id ORDER BY brand_name ASC LIMIT 2,4
FACET brand_name BY brand_id order BY COUNT(*) DESC LIMIT 4;
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+
| id | price | brand_id | title | brand_name | property | j | categories |
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+
| 1 | 306 | 1 | Product Ten Three | Brand One | Six_Ten | {"prop1":66,"prop2":91,"prop3":"One"} | 10,11 |
...
| 20 | 31 | 9 | Product Four One | Brand Nine | Ten_Four | {"prop1":79,"prop2":42,"prop3":"One"} | 12,13,14 |
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+
20 rows in set (0.01 sec)
+-------------+----------+
| brand_name | count(*) |
+-------------+----------+
| Brand One | 1013 |
+-------------+----------+
1 rows in set (0.01 sec)
+-------------+----------+
| brand_name | count(*) |
+-------------+----------+
| Brand Four | 994 |
| Brand Nine | 944 |
| Brand One | 1013 |
| Brand Seven | 965 |
+-------------+----------+
4 rows in set (0.01 sec)
+-------------+----------+
| brand_name | count(*) |
+-------------+----------+
| Brand Six | 1039 |
| Brand Eight | 1033 |
| Brand Three | 1016 |
+-------------+----------+
3 rows in set (0.01 sec)
POST /search -d '
{
"index" : "facetdemo",
"query" : {"match_all" : {} },
"limit": 5,
"aggs" :
{
"group_property" :
{
"terms" :
{
"field":"price",
"size":1,
}
},
"group_brand_id" :
{
"terms" :
{
"field":"brand_id",
"size":3
}
}
}
}
'
{
"took": 3,
"timed_out": false,
"hits": {
"total": 10000,
"hits": [
{
"_id": "1",
"_score": 1,
"_source": {
"price": 197,
"brand_id": 10,
"brand_name": "Brand Ten",
"categories": [
10
]
}
},
...
{
"_id": "5",
"_score": 1,
"_source": {
"price": 805,
"brand_id": 7,
"brand_name": "Brand Seven",
"categories": [
11,
12,
13
]
}
}
]
},
"aggregations": {
"group_property": {
"buckets": [
{
"key": 1000,
"doc_count": 11
}
]
},
"group_brand_id": {
"buckets": [
{
"key": 10,
"doc_count": 1019
},
{
"key": 9,
"doc_count": 954
},
{
"key": 8,
"doc_count": 1021
}
]
}
}
}
$index->setName('facetdemo');
$search = $index->search('');
$search->limit(5);
$search->facet('price','price',1);
$search->facet('brand_id','group_brand_id',3);
$results = $search->get();
print_r($results->getFacets());
Array
(
[price] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => 1000
[doc_count] => 11
)
)
)
[group_brand_id] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => 10
[doc_count] => 1019
)
[1] => Array
(
[key] => 9
[doc_count] => 954
)
[2] => Array
(
[key] => 8
[doc_count] => 1021
)
)
)
)
res =searchApi.search({"index":"facetdemo","query":{"match_all":{}},"limit":5,"aggs":{"group_property":{"terms":{"field":"price","size":1,}},"group_brand_id":{"terms":{"field":"brand_id","size":3}}}})
{'aggregations': {u'group_brand_id': {u'buckets': [{u'doc_count': 1019,
u'key': 10},
{u'doc_count': 954,
u'key': 9},
{u'doc_count': 1021,
u'key': 8}]},
u'group_property': {u'buckets': [{u'doc_count': 11,
u'key': 1000}]}},
'hits': {'hits': [{u'_id': u'1',
u'_score': 1,
u'_source': {u'brand_id': 10,
u'brand_name': u'Brand Ten',
u'categories': [10],
u'price': 197,
u'property': u'Six',
u'title': u'Product Eight One'}},
{u'_id': u'2',
u'_score': 1,
u'_source': {u'brand_id': 6,
u'brand_name': u'Brand Six',
u'categories': [12, 13, 14],
u'price': 671,
u'property': u'Four',
u'title': u'Product Nine Seven'}},
{u'_id': u'3',
u'_score': 1,
u'_source': {u'brand_id': 3,
u'brand_name': u'Brand Three',
u'categories': [13, 14, 15],
u'price': 92,
u'property': u'Six',
u'title': u'Product Five Four'}},
{u'_id': u'4',
u'_score': 1,
u'_source': {u'brand_id': 10,
u'brand_name': u'Brand Ten',
u'categories': [11],
u'price': 713,
u'property': u'Five',
u'title': u'Product Eight Nine'}},
{u'_id': u'5',
u'_score': 1,
u'_source': {u'brand_id': 7,
u'brand_name': u'Brand Seven',
u'categories': [11, 12, 13],
u'price': 805,
u'property': u'Two',
u'title': u'Product Ten Three'}}],
'max_score': None,
'total': 10000},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"facetdemo","query":{"match_all":{}},"limit":5,"aggs":{"group_property":{"terms":{"field":"price","size":1,}},"group_brand_id":{"terms":{"field":"brand_id","size":3}}}});
{"took":0,"timed_out":false,"hits":{"total":10000,"hits":[{"_id":"1","_score":1,"_source":{"price":197,"brand_id":10,"brand_name":"Brand Ten","categories":[10],"title":"Product Eight One","property":"Six"}},{"_id":"2","_score":1,"_source":{"price":671,"brand_id":6,"brand_name":"Brand Six","categories":[12,13,14],"title":"Product Nine Seven","property":"Four"}},{"_id":"3","_score":1,"_source":{"price":92,"brand_id":3,"brand_name":"Brand Three","categories":[13,14,15],"title":"Product Five Four","property":"Six"}},{"_id":"4","_score":1,"_source":{"price":713,"brand_id":10,"brand_name":"Brand Ten","categories":[11],"title":"Product Eight Nine","property":"Five"}},{"_id":"5","_score":1,"_source":{"price":805,"brand_id":7,"brand_name":"Brand Seven","categories":[11,12,13],"title":"Product Ten Three","property":"Two"}}]}}
searchRequest = new SearchRequest();
aggs = new HashMap<String,Object>(){{
put("group_property", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","price");
put("size",1);
}});
}});
put("group_brand_id", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","brand_id");
put("size",3);
}});
}});
}};
searchRequest.setIndex("facetdemo");
searchRequest.setLimit(5);
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setAggs(aggs);
searchResponse = searchApi.search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
aggregations: {group_property={buckets=[{key=1000, doc_count=11}]}, group_brand_id={buckets=[{key=10, doc_count=1019}, {key=9, doc_count=954}, {key=8, doc_count=1021}]}}
hits: class SearchResponseHits {
maxScore: null
total: 10000
hits: [{_id=1, _score=1, _source={price=197, brand_id=10, brand_name=Brand Ten, categories=[10], title=Product Eight One, property=Six}}, {_id=2, _score=1, _source={price=671, brand_id=6, brand_name=Brand Six, categories=[12, 13, 14], title=Product Nine Seven, property=Four}}, {_id=3, _score=1, _source={price=92, brand_id=3, brand_name=Brand Three, categories=[13, 14, 15], title=Product Five Four, property=Six}}, {_id=4, _score=1, _source={price=713, brand_id=10, brand_name=Brand Ten, categories=[11], title=Product Eight Nine, property=Five}}, {_id=5, _score=1, _source={price=805, brand_id=7, brand_name=Brand Seven, categories=[11, 12, 13], title=Product Ten Three, property=Two}}]
}
profile: null
}
var agg1 = new Aggregation("group_property", "price");
agg1.Size = 1;
var agg2 = new Aggregation("group_brand_id", "brand_id");
agg2.Size = 3;
agg2.Size = 100;
object query = new { match_all=null };
var searchRequest = new SearchRequest("facetdemo", query);
searchRequest.Aggs = new List<Aggregation> {agg1, agg2};
var searchResponse = searchApi.Search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
aggregations: {group_property={buckets=[{key=1000, doc_count=11}]}, group_brand_id={buckets=[{key=10, doc_count=1019}, {key=9, doc_count=954}, {key=8, doc_count=1021}]}}
hits: class SearchResponseHits {
maxScore: null
total: 10000
hits: [{_id=1, _score=1, _source={price=197, brand_id=10, brand_name=Brand Ten, categories=[10], title=Product Eight One, property=Six}}, {_id=2, _score=1, _source={price=671, brand_id=6, brand_name=Brand Six, categories=[12, 13, 14], title=Product Nine Seven, property=Four}}, {_id=3, _score=1, _source={price=92, brand_id=3, brand_name=Brand Three, categories=[13, 14, 15], title=Product Five Four, property=Six}}, {_id=4, _score=1, _source={price=713, brand_id=10, brand_name=Brand Ten, categories=[11], title=Product Eight Nine, property=Five}}, {_id=5, _score=1, _source={price=805, brand_id=7, brand_name=Brand Seven, categories=[11, 12, 13], title=Product Ten Three, property=Two}}]
}
profile: null
}
res = await searchApi.search({
index: 'test',
query: { match_all:{} },
aggs: {
name_group: {
terms: { field : 'name', size: 1 }
},
cat_group: {
terms: { field: 'cat' }
}
}
});
{
"took": 0,
"timed_out": false,
"hits": {
"total": 5,
"hits": [
{
"_id": "1",
"_score": 1,
"_source": {
"content": "Text 1",
"name": "Doc 1",
"cat": 1
}
},
...
{
"_id": "5",
"_score": 1,
"_source": {
"content": "Text 5",
"name": "Doc 5",
"cat": 4
}
}
]
},
"aggregations": {
"name_group": {
"buckets": [
{
"key": "Doc 1",
"doc_count": 1
}
]
},
"cat_group": {
"buckets": [
{
"key": 1,
"doc_count": 2
},
...
{
"key": 4,
"doc_count": 1
}
]
}
}
}
query := map[string]interface{} {}
searchRequest.SetQuery(query)
aggByName := manticoreclient.NewAggregation()
aggTerms := manticoreclient.NewAggregationTerms()
aggTerms.SetField("name")
aggByName.SetTerms(aggTerms)
aggByName.SetSize(1)
aggByCat := manticoreclient.NewAggregation()
aggTerms.SetField("cat")
aggByCat.SetTerms(aggTerms)
aggs := map[string]Aggregation{} { "name_group": aggByName, "cat_group": aggByCat }
searchRequest.SetAggs(aggs)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()
{
"took": 0,
"timed_out": false,
"hits": {
"total": 5,
"hits": [
{
"_id": "1",
"_score": 1,
"_source": {
"content": "Text 1",
"name": "Doc 1",
"cat": 1
}
},
...
{
"_id": "5",
"_score": 1,
"_source": {
"content": "Text 5",
"name": "Doc 5",
"cat": 4
}
}
]
},
"aggregations": {
"name_group": {
"buckets": [
{
"key": "Doc 1",
"doc_count": 1
}
]
},
"cat_group": {
"buckets": [
{
"key": 1,
"doc_count": 2
},
...
{
"key": 4,
"doc_count": 1
}
]
}
}
}
When using SQL, a search with facets returns multiple result sets. The MySQL client/library/connector used must support multiple result sets in order to access the facet result sets.
Internally, the FACET is a shorthand for executing a multi-query where the first query contains the main search query and the rest of the queries in the batch have each a clustering. As in the case of multi-query, the common query optimization can kick in for a faceted search, meaning the search query is executed only once, and the facets operate on the search query result, with each facet adding only a fraction of time to the total query time.
To check if the faceted search ran in an optimized mode, you can look in the query log, where all logged queries will contain an xN string, where N is the number of queries that ran in the optimized group. Alternatively, you can check the output of the SHOW META statement, which will display a multiplier metric:
SELECT * FROM facetdemo FACET brand_id FACET price FACET categories;
SHOW META LIKE 'multiplier';
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+
| id | price | brand_id | title | brand_name | property | j | categories |
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+
| 1 | 306 | 1 | Product Ten Three | Brand One | Six_Ten | {"prop1":66,"prop2":91,"prop3":"One"} | 10,11 |
...
+----------+----------+
| brand_id | count(*) |
+----------+----------+
| 1 | 1013 |
...
+-------+----------+
| price | count(*) |
+-------+----------+
| 306 | 7 |
...
+------------+----------+
| categories | count(*) |
+------------+----------+
| 10 | 2436 |
...
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| multiplier | 4 |
+---------------+-------+
1 row in set (0.00 sec)
One of the greatest features of Manticore Search is the ability to combine full-text searching with geo-location. For example, a retailer can offer a search where a user looks for a product, and the result set can indicate the closest shop that has the product in stock, so the user can go in-store and pick it up. A travel site can provide results based on a search limited to a certain area and have the results sorted by the distance from a point (for example, 'search museums near a hotel').
To perform geo-searching, a document needs to contain pairs of latitude/longitude coordinates. The coordinates can be stored as float attributes. If the document has multiple locations, it may be convenient to use a JSON attribute to store coordinate pairs.
table myrt
{
...
rt_attr_float = lat
rt_attr_float = lon
...
}
The coordinates can be stored as degrees or radians.
To find out the distance between two points, the GEODIST() function can be used. GEODIST requires two pairs of coordinates as its first four parameters.
The 5th parameter in a simplified JSON format can configure certain aspects of the function. By default, GEODIST expects coordinates to be in radians, but in=degrees can be added to allow using degrees as input. The coordinates for which we perform the geo distance must have the same type (degrees or radians) as the ones stored in the table; otherwise, results will be misleading.
The calculated distance is by default in meters, but with the out option, it can be transformed to kilometers, feet, or miles. Lastly, by default, a calculation method called adaptive is used. An alternative method based on the haversine algorithm is available; however, this one is slower and less precise.
The result of the function - the distance - can be used in theORDER BY clause to sort the results:
SELECT *, GEODIST(40.7643929, -73.9997683, lat, lon, {in=degrees, out=miles}) AS distance FROM myindex WHERE MATCH('...') ORDER BY distance ASC, WEIGHT() DESC;
Or to limit the results to a radial area around the point:
SELECT *,GEODIST(40.7643929, -73.9997683, lat,lon, {in=degrees, out=miles}) AS distance FROM myindex WHERE MATCH('...') AND distance <1000 ORDER BY WEIGHT(), DISTANCE ASC;
Another geo search feature is the ability to determine if a location is within a specified area. A special function constructs a polygon object, which is then used by another function to test whether a set of coordinates is contained within that polygon or not.
There are two functions available for creating the polygon:
POLY2D is suitable for geo searches when the area has sides shorter than 500km (for polygons with 3-4 sides; for polygons with more sides, lower values should be considered). For areas with longer sides, using GEOPOLY2D is required to maintain accurate results. GEOPOLY2D expects coordinates as latitude/longitude pairs in degrees; using radians will yield results in flat space (similar to POLY2D).
CONTAINS() takes a polygon and a set of coordinates as input and outputs 1 if the point is inside the polygon or 0 otherwise.
SELECT *,CONTAINS(GEOPOLY2D(40.76439, -73.9997, 42.21211, -73.999, 42.21211, -76.123, 40.76439, -76.123), 41.5445, -74.973) AS inside FROM myindex WHERE MATCH('...') AND inside=1;
Percolate queries are also known as Persistent queries, Prospective search, document routing, search in reverse, and inverse search.
The traditional way of conducting searches involves storing documents and performing search queries against them. However, there are cases where we want to apply a query to a newly incoming document to signal a match. Some scenarios where this is desired include monitoring systems that collect data and notify users about specific events, such as reaching a certain threshold for a metric or a particular value appearing in the monitored data. Another example is news aggregation, where users may want to be notified only about certain categories or topics, or even specific "keywords."
In these situations, traditional search is not the best fit, as it assumes the desired search is performed over the entire collection. This process gets multiplied by the number of users, resulting in many queries running over the entire collection, which can cause significant additional load. The alternative approach described in this section involves storing the queries instead and testing them against an incoming new document or a batch of documents.
Google Alerts, AlertHN, Bloomberg Terminal, and other systems that allow users to subscribe to specific content utilize similar technology.
- See percolate for information on creating a PQ table.
- See Adding rules to a percolate table to learn how to add percolate rules (also known as PQ rules). Here's a quick example:
The key thing to remember about percolate queries is that your search queries are already in the table. What you need to provide are documents to check if any of them match any of the stored rules.
You can perform a percolate query via SQL or JSON interfaces, as well as using programming language clients. The SQL approach offers more flexibility, while the HTTP method is simpler and provides most of what you need. The table below can help you understand the differences.
| Desired Behavior | SQL | HTTP | PHP |
|---|---|---|---|
| Provide a single document | CALL PQ('tbl', '{doc1}') |
query.percolate.document{doc1} |
$client->pq()->search([$percolate]) |
| Provide a single document (alternative) | CALL PQ('tbl', 'doc1', 0 as docs_json) |
- | |
| Provide multiple documents | CALL PQ('tbl', ('doc1', 'doc2'), 0 as docs_json) |
query.percolate.documents[{doc1}, {doc2}] |
$client->pq()->search([$percolate]) |
| Provide multiple documents (alternative) | CALL PQ('tbl', ('{doc1}', '{doc2}')) |
- | - |
| Provide multiple documents (alternative) | CALL PQ('tbl', '[{doc1}, {doc2}]') |
- | - |
| Return matching document ids | 0/1 as docs (disabled by default) | Enabled by default | Enabled by default |
| Use document's own id to show in the result | 'id field' as docs_id (disabled by default) | Not available | Not available |
| Consider input documents are JSON | 1 as docs_json (1 by default) | Enabled by default | Enabled by default |
| Consider input documents are plain text | 0 as docs_json (1 by default) | Not available | Not available |
| Sparsed distribution mode | default | default | default |
| Sharded distribution mode | sharded as mode | Not available | Not available |
| Return all info about matching query | 1 as query (0 by default) | Enabled by default | Enabled by default |
| Skip invalid JSON | 1 as skip_bad_json (0 by default) | Not available | Not available |
| Extended info in SHOW META | 1 as verbose (0 by default) | Not available | Not available |
| Define the number which will be added to document ids if no docs_id fields provided (mostly relevant in distributed PQ modes) | 1 as shift (0 by default) | Not available | Not available |
To demonstrate how this works, here are a few examples. Let's create a PQ table with two fields:
and three rules in it:
@title bag@title shoes. Filters: color='red'@title shoes. Filters: color IN('blue', 'green')CREATE TABLE products(title text, color string) type='pq';
INSERT INTO products(query) values('@title bag');
INSERT INTO products(query,filters) values('@title shoes', 'color=\'red\'');
INSERT INTO products(query,filters) values('@title shoes', 'color in (\'blue\', \'green\')');
select * from products;
+---------------------+--------------+------+---------------------------+
| id | query | tags | filters |
+---------------------+--------------+------+---------------------------+
| 1657852401006149635 | @title shoes | | color IN ('blue, 'green') |
| 1657852401006149636 | @title shoes | | color='red' |
| 1657852401006149637 | @title bag | | |
+---------------------+--------------+------+---------------------------+
PUT /pq/products/doc/
{
"query": {
"match": {
"title": "bag"
}
},
"filters": ""
}
PUT /pq/products/doc/
{
"query": {
"match": {
"title": "shoes"
}
},
"filters": "color='red'"
}
PUT /pq/products/doc/
{
"query": {
"match": {
"title": "shoes"
}
},
"filters": "color IN ('blue', 'green')"
}
{
"index": "products",
"type": "doc",
"_id": "1657852401006149661",
"result": "created"
}
{
"index": "products",
"type": "doc",
"_id": "1657852401006149662",
"result": "created"
}
{
"index": "products",
"type": "doc",
"_id": "1657852401006149663",
"result": "created"
}
$index = [
'index' => 'products',
'body' => [
'columns' => [
'title' => ['type' => 'text'],
'color' => ['type' => 'string']
],
'settings' => [
'type' => 'pq'
]
]
];
$client->indices()->create($index);
$query = [
'index' => 'products',
'body' => [ 'query'=>['match'=>['title'=>'bag']]]
];
$client->pq()->doc($query);
$query = [
'index' => 'products',
'body' => [ 'query'=>['match'=>['title'=>'shoes']],'filters'=>"color='red'"]
];
$client->pq()->doc($query);
$query = [
'index' => 'products',
'body' => [ 'query'=>['match'=>['title'=>'shoes']],'filters'=>"color IN ('blue', 'green')"]
];
$client->pq()->doc($query);
Array(
[index] => products
[type] => doc
[_id] => 1657852401006149661
[result] => created
)
Array(
[index] => products
[type] => doc
[_id] => 1657852401006149662
[result] => created
)
Array(
[index] => products
[type] => doc
[_id] => 1657852401006149663
[result] => created
)
utilsApi.sql('create table products(title text, color string) type=\'pq\'')
indexApi.insert({"index" : "products", "doc" : {"query" : "@title bag" }})
indexApi.insert({"index" : "products", "doc" : {"query" : "@title shoes", "filters": "color='red'" }})
indexApi.insert({"index" : "products", "doc" : {"query" : "@title shoes","filters": "color IN ('blue', 'green')" }})
{'created': True,
'found': None,
'id': 0,
'index': 'products',
'result': 'created'}
{'created': True,
'found': None,
'id': 0,
'index': 'products',
'result': 'created'}
{'created': True,
'found': None,
'id': 0,
'index': 'products',
'result': 'created'}
res = await utilsApi.sql('create table products(title text, color string) type=\'pq\'');
res = indexApi.insert({"index" : "products", "doc" : {"query" : "@title bag" }});
res = indexApi.insert({"index" : "products", "doc" : {"query" : "@title shoes", "filters": "color='red'" }});
res = indexApi.insert({"index" : "products", "doc" : {"query" : "@title shoes","filters": "color IN ('blue', 'green')" }});
"_index":"products","_id":0,"created":true,"result":"created"}
{"_index":"products","_id":0,"created":true,"result":"created"}
{"_index":"products","_id":0,"created":true,"result":"created"}
utilsApi.sql("create table products(title text, color string) type='pq'");
doc = new HashMap<String,Object>(){{
put("query", "@title bag");
}};
newdoc = new InsertDocumentRequest();
newdoc.index("products").setDoc(doc);
indexApi.insert(newdoc);
doc = new HashMap<String,Object>(){{
put("query", "@title shoes");
put("filters", "color='red'");
}};
newdoc = new InsertDocumentRequest();
newdoc.index("products").setDoc(doc);
indexApi.insert(newdoc);
doc = new HashMap<String,Object>(){{
put("query", "@title shoes");
put("filters", "color IN ('blue', 'green')");
}};
newdoc = new InsertDocumentRequest();
newdoc.index("products").setDoc(doc);
indexApi.insert(newdoc);
{total=0, error=, warning=}
class SuccessResponse {
index: products
id: 0
created: true
result: created
found: null
}
class SuccessResponse {
index: products
id: 0
created: true
result: created
found: null
}
class SuccessResponse {
index: products
id: 0
created: true
result: created
found: null
}
utilsApi.Sql("create table products(title text, color string) type='pq'");
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("query", "@title bag");
InsertDocumentRequest newdoc = new InsertDocumentRequest(index: "products", doc: doc);
indexApi.Insert(newdoc);
doc = new Dictionary<string, Object>();
doc.Add("query", "@title shoes");
doc.Add("filters", "color='red'");
newdoc = new InsertDocumentRequest(index: "products", doc: doc);
indexApi.Insert(newdoc);
doc = new Dictionary<string, Object>();
doc.Add("query", "@title bag");
doc.Add("filters", "color IN ('blue', 'green')");
newdoc = new InsertDocumentRequest(index: "products", doc: doc);
indexApi.Insert(newdoc);
{total=0, error="", warning=""}
class SuccessResponse {
index: products
id: 0
created: true
result: created
found: null
}
class SuccessResponse {
index: products
id: 0
created: true
result: created
found: null
}
class SuccessResponse {
index: products
id: 0
created: true
result: created
found: null
}
res = await utilsApi.sql("create table test_pq(title text, color string) type='pq'");
res = indexApi.insert({
index: 'test_pq',
doc: { query : '@title bag' }
});
res = indexApi.insert(
index: 'test_pq',
doc: { query: '@title shoes', filters: "color='red'" }
});
res = indexApi.insert({
index: 'test_pq',
doc: { query : '@title shoes', filters: "color IN ('blue', 'green')" }
});
{
"_index":"test_pq",
"_id":1657852401006149661,
"created":true,
"result":"created"
}
{
"_index":"test_pq",
"_id":1657852401006149662,
"created":true,
"result":"created"
}
{
"_index":"test_pq",
"_id":1657852401006149663,
"created":true,
"result":"created"
}
apiClient.UtilsAPI.Sql(context.Background()).Body("create table test_pq(title text, color string) type='pq'").Execute()
indexDoc := map[string]interface{} {"query": "@title bag"}
indexReq := manticoreclient.NewInsertDocumentRequest("test_pq", indexDoc)
apiClient.IndexAPI.Insert(context.Background()).InsertDocumentRequest(*indexReq).Execute();
indexDoc = map[string]interface{} {"query": "@title shoes", "filters": "color='red'"}
indexReq = manticoreclient.NewInsertDocumentRequest("test_pq", indexDoc)
apiClient.IndexAPI.Insert(context.Background()).InsertDocumentRequest(*indexReq).Execute();
indexDoc = map[string]interface{} {"query": "@title shoes", "filters": "color IN ('blue', 'green')"}
indexReq = manticoreclient.NewInsertDocumentRequest("test_pq", indexDoc)
apiClient.IndexAPI.Insert(context.Background()).InsertDocumentRequest(*indexReq).Execute();
{
"_index":"test_pq",
"_id":1657852401006149661,
"created":true,
"result":"created"
}
{
"_index":"test_pq",
"_id":1657852401006149662,
"created":true,
"result":"created"
}
{
"_index":"test_pq",
"_id":1657852401006149663,
"created":true,
"result":"created"
}
The first document doesn't match any rules. It could match the first two, but they require additional filters.
The second document matches one rule. Note that CALL PQ by default expects a document to be a JSON, but if you use 0 as docs_json, you can pass a plain string instead.
SQL:
CALL PQ('products', 'Beautiful shoes', 0 as docs_json);
CALL PQ('products', 'What a nice bag', 0 as docs_json);
CALL PQ('products', '{"title": "What a nice bag"}');
+---------------------+
| id |
+---------------------+
| 1657852401006149637 |
+---------------------+
+---------------------+
| id |
+---------------------+
| 1657852401006149637 |
+---------------------+
POST /pq/products/_search
{
"query": {
"percolate": {
"document": {
"title": "What a nice bag"
}
}
}
}
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "products",
"_type": "doc",
"_id": "1657852401006149644",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
$percolate = [
'index' => 'products',
'body' => [
'query' => [
'percolate' => [
'document' => [
'title' => 'What a nice bag'
]
]
]
]
];
$client->pq()->search($percolate);
Array
(
[took] => 0
[timed_out] =>
[hits] => Array
(
[total] => 1
[max_score] => 1
[hits] => Array
(
[0] => Array
(
[_index] => products
[_type] => doc
[_id] => 1657852401006149644
[_score] => 1
[_source] => Array
(
[query] => Array
(
[match] => Array
(
[title] => bag
)
)
)
[fields] => Array
(
[_percolator_document_slot] => Array
(
[0] => 1
)
)
)
)
)
)
searchApi.percolate('products',{"query":{"percolate":{"document":{"title":"What a nice bag"}}}})
{'hits': {'hits': [{u'_id': u'2811025403043381480',
u'_index': u'products',
u'_score': u'1',
u'_source': {u'query': {u'ql': u'@title bag'}},
u'_type': u'doc',
u'fields': {u'_percolator_document_slot': [1]}}],
'total': 1},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.percolate('products',{"query":{"percolate":{"document":{"title":"What a nice bag"}}}});
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1,
"hits": [
{
"_index": "products",
"_type": "doc",
"_id": "2811045522851233808",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
PercolateRequest percolateRequest = new PercolateRequest();
query = new HashMap<String,Object>(){{
put("percolate",new HashMap<String,Object >(){{
put("document", new HashMap<String,Object >(){{
put("title","what a nice bag");
}});
}});
}};
percolateRequest.query(query);
searchApi.percolate("test_pq",percolateRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 1
maxScore: 1
hits: [{_index=products, _type=doc, _id=2811045522851234109, _score=1, _source={query={ql=@title bag}}, fields={_percolator_document_slot=[1]}}]
aggregations: null
}
profile: null
}
Dictionary<string, Object> percolateDoc = new Dictionary<string, Object>();
percolateDoc.Add("document", new Dictionary<string, Object> {{ "title", "what a nice bag" }});
Dictionary<string, Object> query = new Dictionary<string, Object> {{ "percolate", percolateDoc }};
PercolateRequest percolateRequest = new PercolateRequest(query=query);
searchApi.Percolate("test_pq",percolateRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 1
maxScore: 1
hits: [{_index=products, _type=doc, _id=2811045522851234109, _score=1, _source={query={ql=@title bag}}, fields={_percolator_document_slot=[1]}}]
aggregations: null
}
profile: null
}
res = await searchApi.percolate('test_pq', { query: { percolate: { document : { title : 'What a nice bag' } } } } );
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1,
"hits": [
{
"_index": "test_pq",
"_type": "doc",
"_id": "1657852401006149661",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
query := map[string]interface{} {"title": "what a nice bag"}
percolateRequestQuery := manticoreclient.NewPercolateQuery(query)
percolateRequest := manticoreclient.NewPercolateRequest(percolateRequestQuery)
res, _, _ := apiClient.SearchAPI.Percolate(context.Background(), "test_pq").PercolateRequest(*percolateRequest).Execute()
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1,
"hits": [
{
"_index": "test_pq",
"_type": "doc",
"_id": "1657852401006149661",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
SQL:
CALL PQ('products', '{"title": "What a nice bag"}', 1 as query);
+---------------------+------------+------+---------+
| id | query | tags | filters |
+---------------------+------------+------+---------+
| 1657852401006149637 | @title bag | | |
+---------------------+------------+------+---------+
POST /pq/products/_search
{
"query": {
"percolate": {
"document": {
"title": "What a nice bag"
}
}
}
}
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "products",
"_type": "doc",
"_id": "1657852401006149644",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
$percolate = [
'index' => 'products',
'body' => [
'query' => [
'percolate' => [
'document' => [
'title' => 'What a nice bag'
]
]
]
]
];
$client->pq()->search($percolate);
Array
(
[took] => 0
[timed_out] =>
[hits] => Array
(
[total] => 1
[max_score] => 1
[hits] => Array
(
[0] => Array
(
[_index] => products
[_type] => doc
[_id] => 1657852401006149644
[_score] => 1
[_source] => Array
(
[query] => Array
(
[match] => Array
(
[title] => bag
)
)
)
[fields] => Array
(
[_percolator_document_slot] => Array
(
[0] => 1
)
)
)
)
)
)
searchApi.percolate('products',{"query":{"percolate":{"document":{"title":"What a nice bag"}}}})
{'hits': {'hits': [{u'_id': u'2811025403043381480',
u'_index': u'products',
u'_score': u'1',
u'_source': {u'query': {u'ql': u'@title bag'}},
u'_type': u'doc',
u'fields': {u'_percolator_document_slot': [1]}}],
'total': 1},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.percolate('products',{"query":{"percolate":{"document":{"title":"What a nice bag"}}}});
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1,
"hits": [
{
"_index": "products",
"_type": "doc",
"_id": "2811045522851233808",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
PercolateRequest percolateRequest = new PercolateRequest();
query = new HashMap<String,Object>(){{
put("percolate",new HashMap<String,Object >(){{
put("document", new HashMap<String,Object >(){{
put("title","what a nice bag");
}});
}});
}};
percolateRequest.query(query);
searchApi.percolate("test_pq",percolateRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 1
maxScore: 1
hits: [{_index=products, _type=doc, _id=2811045522851234109, _score=1, _source={query={ql=@title bag}}, fields={_percolator_document_slot=[1]}}]
aggregations: null
}
profile: null
}
Dictionary<string, Object> percolateDoc = new Dictionary<string, Object>();
percolateDoc.Add("document", new Dictionary<string, Object> {{ "title", "what a nice bag" }});
Dictionary<string, Object> query = new Dictionary<string, Object> {{ "percolate", percolateDoc }};
PercolateRequest percolateRequest = new PercolateRequest(query=query);
searchApi.Percolate("test_pq",percolateRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 1
maxScore: 1
hits: [{_index=products, _type=doc, _id=2811045522851234109, _score=1, _source={query={ql=@title bag}}, fields={_percolator_document_slot=[1]}}]
aggregations: null
}
profile: null
}
res = await searchApi.percolate('test_pq', { query: { percolate: { document : { title : 'What a nice bag' } } } } );
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1,
"hits": [
{
"_index": "test_pq",
"_type": "doc",
"_id": "1657852401006149661",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
query := map[string]interface{} {"title": "what a nice bag"}
percolateRequestQuery := manticoreclient.NewPercolateQuery(query)
percolateRequest := manticoreclient.NewPercolateRequest(percolateRequestQuery)
res, _, _ := apiClient.SearchAPI.Percolate(context.Background(), "test_pq").PercolateRequest(*percolateRequest).Execute()
{
"took": 0,
"timed_out": false,
"hits": {
"total": 1,
"hits": [
{
"_index": "test_pq",
"_type": "doc",
"_id": "1657852401006149661",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
Note that with CALL PQ, you can provide multiple documents in different ways:
('doc1', 'doc2'). This requires 0 as docs_json('{doc1}', '{doc2}')'[{doc1}, {doc2}]'SQL:
CALL PQ('products', ('nice pair of shoes', 'beautiful bag'), 1 as query, 0 as docs_json);
CALL PQ('products', ('{"title": "nice pair of shoes", "color": "red"}', '{"title": "beautiful bag"}'), 1 as query);
CALL PQ('products', '[{"title": "nice pair of shoes", "color": "blue"}, {"title": "beautiful bag"}]', 1 as query);
+---------------------+------------+------+---------+
| id | query | tags | filters |
+---------------------+------------+------+---------+
| 1657852401006149637 | @title bag | | |
+---------------------+------------+------+---------+
+---------------------+--------------+------+-------------+
| id | query | tags | filters |
+---------------------+--------------+------+-------------+
| 1657852401006149636 | @title shoes | | color='red' |
| 1657852401006149637 | @title bag | | |
+---------------------+--------------+------+-------------+
+---------------------+--------------+------+---------------------------+
| id | query | tags | filters |
+---------------------+--------------+------+---------------------------+
| 1657852401006149635 | @title shoes | | color IN ('blue, 'green') |
| 1657852401006149637 | @title bag | | |
+---------------------+--------------+------+---------------------------+
POST /pq/products/_search
{
"query": {
"percolate": {
"documents": [
{"title": "nice pair of shoes", "color": "blue"},
{"title": "beautiful bag"}
]
}
}
}
{
"took": 0,
"timed_out": false,
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "products",
"_type": "doc",
"_id": "1657852401006149644",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
2
]
}
},
{
"_index": "products",
"_type": "doc",
"_id": "1657852401006149646",
"_score": "1",
"_source": {
"query": {
"ql": "@title shoes"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
$percolate = [
'index' => 'products',
'body' => [
'query' => [
'percolate' => [
'documents' => [
['title' => 'nice pair of shoes','color'=>'blue'],
['title' => 'beautiful bag']
]
]
]
]
];
$client->pq()->search($percolate);
Array
(
[took] => 23
[timed_out] =>
[hits] => Array
(
[total] => 2
[max_score] => 1
[hits] => Array
(
[0] => Array
(
[_index] => products
[_type] => doc
[_id] => 2810781492890828819
[_score] => 1
[_source] => Array
(
[query] => Array
(
[match] => Array
(
[title] => bag
)
)
)
[fields] => Array
(
[_percolator_document_slot] => Array
(
[0] => 2
)
)
)
[1] => Array
(
[_index] => products
[_type] => doc
[_id] => 2810781492890828821
[_score] => 1
[_source] => Array
(
[query] => Array
(
[match] => Array
(
[title] => shoes
)
)
)
[fields] => Array
(
[_percolator_document_slot] => Array
(
[0] => 1
)
)
)
)
)
)
searchApi.percolate('products',{"query":{"percolate":{"documents":[{"title":"nice pair of shoes","color":"blue"},{"title":"beautiful bag"}]}}})
{'hits': {'hits': [{u'_id': u'2811025403043381494',
u'_index': u'products',
u'_score': u'1',
u'_source': {u'query': {u'ql': u'@title bag'}},
u'_type': u'doc',
u'fields': {u'_percolator_document_slot': [2]}},
{u'_id': u'2811025403043381496',
u'_index': u'products',
u'_score': u'1',
u'_source': {u'query': {u'ql': u'@title shoes'}},
u'_type': u'doc',
u'fields': {u'_percolator_document_slot': [1]}}],
'total': 2},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.percolate('products',{"query":{"percolate":{"documents":[{"title":"nice pair of shoes","color":"blue"},{"title":"beautiful bag"}]}}});
{
"took": 6,
"timed_out": false,
"hits": {
"total": 2,
"hits": [
{
"_index": "products",
"_type": "doc",
"_id": "2811045522851233808",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
2
]
}
},
{
"_index": "products",
"_type": "doc",
"_id": "2811045522851233810",
"_score": "1",
"_source": {
"query": {
"ql": "@title shoes"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
percolateRequest = new PercolateRequest();
query = new HashMap<String,Object>(){{
put("percolate",new HashMap<String,Object >(){{
put("documents", new ArrayList<Object>(){{
add(new HashMap<String,Object >(){{
put("title","nice pair of shoes");
put("color","blue");
}});
add(new HashMap<String,Object >(){{
put("title","beautiful bag");
}});
}});
}});
}};
percolateRequest.query(query);
searchApi.percolate("products",percolateRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 2
maxScore: 1
hits: [{_index=products, _type=doc, _id=2811045522851234133, _score=1, _source={query={ql=@title bag}}, fields={_percolator_document_slot=[2]}}, {_index=products, _type=doc, _id=2811045522851234135, _score=1, _source={query={ql=@title shoes}}, fields={_percolator_document_slot=[1]}}]
aggregations: null
}
profile: null
}
var doc1 = new Dictionary<string, Object>();
doc1.Add("title","nice pair of shoes");
doc1.Add("color","blue");
var doc2 = new Dictionary<string, Object>();
doc2.Add("title","beautiful bag");
var docs = new List<Object> {doc1, doc2};
Dictionary<string, Object> percolateDoc = new Dictionary<string, Object> {{ "documents", docs }};
Dictionary<string, Object> query = new Dictionary<string, Object> {{ "percolate", percolateDoc }};
PercolateRequest percolateRequest = new PercolateRequest(query=query);
searchApi.Percolate("products",percolateRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 2
maxScore: 1
hits: [{_index=products, _type=doc, _id=2811045522851234133, _score=1, _source={query={ql=@title bag}}, fields={_percolator_document_slot=[2]}}, {_index=products, _type=doc, _id=2811045522851234135, _score=1, _source={query={ql=@title shoes}}, fields={_percolator_document_slot=[1]}}]
aggregations: null
}
profile: null
}
docs = [ {title : 'What a nice bag'}, {title : 'Really nice shoes'} ];
res = await searchApi.percolate('test_pq', { query: { percolate: { documents : docs } } } );
{
"took": 0,
"timed_out": false,
"hits": {
"total": 2,
"hits": [
{
"_index": "test_pq",
"_type": "doc",
"_id": "1657852401006149661",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
},
{
"_index": "test_pq",
"_type": "doc",
"_id": "1657852401006149662",
"_score": "1",
"_source": {
"query": {
"ql": "@title shoes"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
doc1 := map[string]interface{} {"title": "What a nice bag"}
doc2 := map[string]interface{} {"title": "Really nice shoes"}
query := []interface{} {doc1, doc2}
percolateRequestQuery := manticoreclient.NewPercolateQuery(query)
percolateRequest := manticoreclient.NewPercolateRequest(percolateRequestQuery)
res, _, _ := apiClient.SearchAPI.Percolate(context.Background(), "test_pq").PercolateRequest(*percolateRequest).Execute()
{
"took": 0,
"timed_out": false,
"hits": {
"total": 2,
"hits": [
{
"_index": "test_pq",
"_type": "doc",
"_id": "1657852401006149661",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
},
{
"_index": "test_pq",
"_type": "doc",
"_id": "1657852401006149662",
"_score": "1",
"_source": {
"query": {
"ql": "@title shoes"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
Using the option 1 as docs allows you to see which documents of the provided ones match which rules.
SQL:
CALL PQ('products', '[{"title": "nice pair of shoes", "color": "blue"}, {"title": "beautiful bag"}]', 1 as query, 1 as docs);
+---------------------+-----------+--------------+------+---------------------------+
| id | documents | query | tags | filters |
+---------------------+-----------+--------------+------+---------------------------+
| 1657852401006149635 | 1 | @title shoes | | color IN ('blue, 'green') |
| 1657852401006149637 | 2 | @title bag | | |
+---------------------+-----------+--------------+------+---------------------------+
POST /pq/products/_search
{
"query": {
"percolate": {
"documents": [
{"title": "nice pair of shoes", "color": "blue"},
{"title": "beautiful bag"}
]
}
}
}
{
"took": 0,
"timed_out": false,
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "products",
"_type": "doc",
"_id": "1657852401006149644",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
2
]
}
},
{
"_index": "products",
"_type": "doc",
"_id": "1657852401006149646",
"_score": "1",
"_source": {
"query": {
"ql": "@title shoes"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
$percolate = [
'index' => 'products',
'body' => [
'query' => [
'percolate' => [
'documents' => [
['title' => 'nice pair of shoes','color'=>'blue'],
['title' => 'beautiful bag']
]
]
]
]
];
$client->pq()->search($percolate);
Array
(
[took] => 23
[timed_out] =>
[hits] => Array
(
[total] => 2
[max_score] => 1
[hits] => Array
(
[0] => Array
(
[_index] => products
[_type] => doc
[_id] => 2810781492890828819
[_score] => 1
[_source] => Array
(
[query] => Array
(
[match] => Array
(
[title] => bag
)
)
)
[fields] => Array
(
[_percolator_document_slot] => Array
(
[0] => 2
)
)
)
[1] => Array
(
[_index] => products
[_type] => doc
[_id] => 2810781492890828821
[_score] => 1
[_source] => Array
(
[query] => Array
(
[match] => Array
(
[title] => shoes
)
)
)
[fields] => Array
(
[_percolator_document_slot] => Array
(
[0] => 1
)
)
)
)
)
)
searchApi.percolate('products',{"query":{"percolate":{"documents":[{"title":"nice pair of shoes","color":"blue"},{"title":"beautiful bag"}]}}})
{'hits': {'hits': [{u'_id': u'2811025403043381494',
u'_index': u'products',
u'_score': u'1',
u'_source': {u'query': {u'ql': u'@title bag'}},
u'_type': u'doc',
u'fields': {u'_percolator_document_slot': [2]}},
{u'_id': u'2811025403043381496',
u'_index': u'products',
u'_score': u'1',
u'_source': {u'query': {u'ql': u'@title shoes'}},
u'_type': u'doc',
u'fields': {u'_percolator_document_slot': [1]}}],
'total': 2},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.percolate('products',{"query":{"percolate":{"documents":[{"title":"nice pair of shoes","color":"blue"},{"title":"beautiful bag"}]}}});
{
"took": 6,
"timed_out": false,
"hits": {
"total": 2,
"hits": [
{
"_index": "products",
"_type": "doc",
"_id": "2811045522851233808",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
2
]
}
},
{
"_index": "products",
"_type": "doc",
"_id": "2811045522851233810",
"_score": "1",
"_source": {
"query": {
"ql": "@title shoes"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
percolateRequest = new PercolateRequest();
query = new HashMap<String,Object>(){{
put("percolate",new HashMap<String,Object >(){{
put("documents", new ArrayList<Object>(){{
add(new HashMap<String,Object >(){{
put("title","nice pair of shoes");
put("color","blue");
}});
add(new HashMap<String,Object >(){{
put("title","beautiful bag");
}});
}});
}});
}};
percolateRequest.query(query);
searchApi.percolate("products",percolateRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 2
maxScore: 1
hits: [{_index=products, _type=doc, _id=2811045522851234133, _score=1, _source={query={ql=@title bag}}, fields={_percolator_document_slot=[2]}}, {_index=products, _type=doc, _id=2811045522851234135, _score=1, _source={query={ql=@title shoes}}, fields={_percolator_document_slot=[1]}}]
aggregations: null
}
profile: null
}
var doc1 = new Dictionary<string, Object>();
doc1.Add("title","nice pair of shoes");
doc1.Add("color","blue");
var doc2 = new Dictionary<string, Object>();
doc2.Add("title","beautiful bag");
var docs = new List<Object> {doc1, doc2};
Dictionary<string, Object> percolateDoc = new Dictionary<string, Object> {{ "documents", docs }};
Dictionary<string, Object> query = new Dictionary<string, Object> {{ "percolate", percolateDoc }};
PercolateRequest percolateRequest = new PercolateRequest(query=query);
searchApi.Percolate("products",percolateRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 2
maxScore: 1
hits: [{_index=products, _type=doc, _id=2811045522851234133, _score=1, _source={query={ql=@title bag}}, fields={_percolator_document_slot=[2]}}, {_index=products, _type=doc, _id=2811045522851234135, _score=1, _source={query={ql=@title shoes}}, fields={_percolator_document_slot=[1]}}]
aggregations: null
}
profile: null
}
docs = [ {title : 'What a nice bag'}, {title : 'Really nice shoes'} ];
res = await searchApi.percolate('test_pq', { query: { percolate: { documents : docs } } } );
{
"took": 0,
"timed_out": false,
"hits": {
"total": 2,
"hits": [
{
"_index": "test_pq",
"_type": "doc",
"_id": "1657852401006149661",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
},
{
"_index": "test_pq",
"_type": "doc",
"_id": "1657852401006149662",
"_score": "1",
"_source": {
"query": {
"ql": "@title shoes"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
doc1 := map[string]interface{} {"title": "What a nice bag"}
doc2 := map[string]interface{} {"title": "Really nice shoes"}
query := []interface{} {doc1, doc2}
percolateRequestQuery := manticoreclient.NewPercolateQuery(query)
percolateRequest := manticoreclient.NewPercolateRequest(percolateRequestQuery)
res, _, _ := apiClient.SearchAPI.Percolate(context.Background(), "test_pq").PercolateRequest(*percolateRequest).Execute()
{
"took": 0,
"timed_out": false,
"hits": {
"total": 2,
"hits": [
{
"_index": "test_pq",
"_type": "doc",
"_id": "1657852401006149661",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
},
{
"_index": "test_pq",
"_type": "doc",
"_id": "1657852401006149662",
"_score": "1",
"_source": {
"query": {
"ql": "@title shoes"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
By default, matching document ids correspond to their relative numbers in the list you provide. However, in some cases, each document already has its own id. For this case, there's an option 'id field name' as docs_id for CALL PQ.
Note that if the id cannot be found by the provided field name, the PQ rule will not be shown in the results.
This option is only available for CALL PQ via SQL.
CALL PQ('products', '[{"id": 123, "title": "nice pair of shoes", "color": "blue"}, {"id": 456, "title": "beautiful bag"}]', 1 as query, 'id' as docs_id, 1 as docs);
+---------------------+-----------+--------------+------+---------------------------+
| id | documents | query | tags | filters |
+---------------------+-----------+--------------+------+---------------------------+
| 1657852401006149664 | 456 | @title bag | | |
| 1657852401006149666 | 123 | @title shoes | | color IN ('blue, 'green') |
+---------------------+-----------+--------------+------+---------------------------+
When using CALL PQ with separate JSONs, you can use the option 1 as skip_bad_json to skip any invalid JSONs in the input. In the example below, the 2nd query fails due to an invalid JSON, but the 3rd query avoids the error by using 1 as skip_bad_json. Keep in mind that this option is not available when sending JSON queries over HTTP, as the whole JSON query must be valid in that case.
SQL:
CALL PQ('products', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag"}'));
CALL PQ('products', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag}'));
CALL PQ('products', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag}'), 1 as skip_bad_json);
+---------------------+
| id |
+---------------------+
| 1657852401006149635 |
| 1657852401006149637 |
+---------------------+
ERROR 1064 (42000): Bad JSON objects in strings: 2
+---------------------+
| id |
+---------------------+
| 1657852401006149635 |
+---------------------+
Percolate queries are designed with high throughput and large data volumes in mind. To optimize performance for lower latency and higher throughput, consider the following.
There are two modes of distribution for a percolate table and how a percolate query can work against it:
Assume you have table pq_d2 defined as:
table pq_d2
{
type = distributed
agent = 127.0.0.1:6712:pq
agent = 127.0.0.1:6712:ptitle
}
Each of 'pq' and 'ptitle' contains:
SELECT * FROM pq;
+------+-------------+------+-------------------+
| id | query | tags | filters |
+------+-------------+------+-------------------+
| 1 | filter test | | gid>=10 |
| 2 | angry | | gid>=10 OR gid<=3 |
+------+-------------+------+-------------------+
2 rows in set (0.01 sec)
POST /pq/pq/_search
{
"took":0,
"timed_out":false,
"hits":{
"total":2,
"hits":[
{
"_id":"1",
"_score":1,
"_source":{
"query":{ "ql":"filter test" },
"tags":"",
"filters":"gid>=10"
}
},
{
"_id":"2",
"_score":1,
"_source":{
"query":{"ql":"angry"},
"tags":"",
"filters":"gid>=10 OR gid<=3"
}
}
]
}
}
$params = [
'index' => 'pq',
'body' => [
]
];
$response = $client->pq()->search($params);
(
[took] => 0
[timed_out] =>
[hits] =>
(
[total] => 2
[hits] =>
(
[0] =>
(
[_id] => 1
[_score] => 1
[_source] =>
(
[query] =>
(
[ql] => filter test
)
[tags] =>
[filters] => gid>=10
)
),
[1] =>
(
[_id] => 1
[_score] => 1
[_source] =>
(
[query] =>
(
[ql] => angry
)
[tags] =>
[filters] => gid>=10 OR gid<=3
)
)
)
)
)
searchApi.search({"index":"pq","query":{"match_all":{}}})
{'hits': {'hits': [{u'_id': u'2811025403043381501',
u'_score': 1,
u'_source': {u'filters': u"gid>=10",
u'query': u'filter test',
u'tags': u''}},
{u'_id': u'2811025403043381502',
u'_score': 1,
u'_source': {u'filters': u"gid>=10 OR gid<=3",
u'query': u'angry',
u'tags': u''}}],
'total': 2},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"pq","query":{"match_all":{}}});
{'hits': {'hits': [{u'_id': u'2811025403043381501',
u'_score': 1,
u'_source': {u'filters': u"gid>=10",
u'query': u'filter test',
u'tags': u''}},
{u'_id': u'2811025403043381502',
u'_score': 1,
u'_source': {u'filters': u"gid>=10 OR gid<=3",
u'query': u'angry',
u'tags': u''}}],
'total': 2},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.search({"index":"pq","query":{"match_all":{}}});
{"hits": {"hits": [{"_id": "2811025403043381501",
"_score": 1,
"_source": {"filters": u"gid>=10",
"query": "filter test",
"tags": ""}},
{"_id": "2811025403043381502",
"_score": 1,
"_source": {"filters": u"gid>=10 OR gid<=3",
"query": "angry",
"tags": ""}}],
"total": 2},
"timed_out": false,
"took": 0}
Map<String,Object> query = new HashMap<String,Object>();
query.put("match_all",null);
SearchRequest searchRequest = new SearchRequest();
searchRequest.setIndex("pq");
searchRequest.setQuery(query);
SearchResponse searchResponse = searchApi.search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 2
maxScore: null
hits: [{_id=2811045522851233962, _score=1, _source={filters=gid>=10, query=filter test, tags=}}, {_id=2811045522851233951, _score=1, _source={filters=gid>=10 OR gid<=3, query=angry,tags=}}]
aggregations: null
}
profile: null
}
object query = new { match_all=null };
SearchRequest searchRequest = new SearchRequest("pq", query);
SearchResponse searchResponse = searchApi.Search(searchRequest);
class SearchResponse {
took: 0
timedOut: false
hits: class SearchResponseHits {
total: 2
maxScore: null
hits: [{_id=2811045522851233962, _score=1, _source={filters=gid>=10, query=filter test, tags=}}, {_id=2811045522851233951, _score=1, _source={filters=gid>=10 OR gid<=3, query=angry,tags=}}]
aggregations: null
}
profile: null
}
res = await searchApi.search({"index":"test_pq","query":{"match_all":{}}});
{
'hits':
{
'hits':
[{
'_id': '2811025403043381501',
'_score': 1,
'_source':
{
'filters': "gid>=10",
'query': 'filter test',
'tags': ''
}
},
{
'_id':
'2811025403043381502',
'_score': 1,
'_source':
{
'filters': "gid>=10 OR gid<=3",
'query': 'angry',
'tags': ''
}
}],
'total': 2
},
'profile': None,
'timed_out': False,
'took': 0
}
query := map[string]interface{} {}
percolateRequestQuery := manticoreclient.NewPercolateRequestQuery(query)
percolateRequest := manticoreclient.NewPercolateRequest(percolateRequestQuery)
res, _, _ := apiClient.SearchAPI.Percolate(context.Background(), "test_pq").PercolateRequest(*percolateRequest).Execute()
{
'hits':
{
'hits':
[{
'_id': '2811025403043381501',
'_score': 1,
'_source':
{
'filters': "gid>=10",
'query': 'filter test',
'tags': ''
}
},
{
'_id':
'2811025403043381502',
'_score': 1,
'_source':
{
'filters': "gid>=10 OR gid<=3",
'query': 'angry',
'tags': ''
}
}],
'total': 2
},
'profile': None,
'timed_out': False,
'took': 0
}
And you execute CALL PQ on the distributed table with a couple of documents.
CALL PQ ('pq_d2', ('{"title":"angry test", "gid":3 }', '{"title":"filter test doc2", "gid":13}'), 1 AS docs);
+------+-----------+
| id | documents |
+------+-----------+
| 1 | 2 |
| 2 | 1 |
+------+-----------+
POST /pq/pq/_search -d '
"query":
{
"percolate":
{
"documents" : [
{ "title": "angry test", "gid": 3 },
{ "title": "filter test doc2", "gid": 13 }
]
}
}
'
{
"took":0,
"timed_out":false,
"hits":{
"total":2,"hits":[
{
"_id":"2",
"_score":1,
"_source":{
"query":{"title":"angry"},
"tags":"",
"filters":"gid>=10 OR gid<=3"
}
}
{
"_id":"1",
"_score":1,
"_source":{
"query":{"ql":"filter test"},
"tags":"",
"filters":"gid>=10"
}
},
]
}
}
$params = [
'index' => 'pq',
'body' => [
'query' => [
'percolate' => [
'documents' => [
[
'title'=>'angry test',
'gid' => 3
],
[
'title'=>'filter test doc2',
'gid' => 13
],
]
]
]
]
];
$response = $client->pq()->search($params);
(
[took] => 0
[timed_out] =>
[hits] =>
(
[total] => 2
[hits] =>
(
[0] =>
(
[_index] => pq
[_type] => doc
[_id] => 2
[_score] => 1
[_source] =>
(
[query] =>
(
[ql] => angry
)
[tags] =>
[filters] => gid>=10 OR gid<=3
),
[fields] =>
(
[_percolator_document_slot] =>
(
[0] => 1
)
)
),
[1] =>
(
[_index] => pq
[_id] => 1
[_score] => 1
[_source] =>
(
[query] =>
(
[ql] => filter test
)
[tags] =>
[filters] => gid>=10
)
[fields] =>
(
[_percolator_document_slot] =>
(
[0] => 0
)
)
)
)
)
)
searchApi.percolate('pq',{"percolate":{"documents":[{"title":"angry test","gid":3},{"title":"filter test doc2","gid":13}]}})
{'hits': {'hits': [{u'_id': u'2811025403043381480',
u'_index': u'pq',
u'_score': u'1',
u'_source': {u'query': {u'ql': u'angry'},u'tags':u'',u'filters':u"gid>=10 OR gid<=3"},
u'_type': u'doc',
u'fields': {u'_percolator_document_slot': [1]}},
{u'_id': u'2811025403043381501',
u'_index': u'pq',
u'_score': u'1',
u'_source': {u'query': {u'ql': u'filter test'},u'tags':u'',u'filters':u"gid>=10"},
u'_type': u'doc',
u'fields': {u'_percolator_document_slot': [1]}}],
'total': 2},
'profile': None,
'timed_out': False,
'took': 0}
res = await searchApi.percolate('pq',{"percolate":{"documents":[{"title":"angry test","gid":3},{"title":"filter test doc2","gid":13}]}});
{'hits': {'hits': [{u'_id': u'2811025403043381480',
u'_index': u'pq',
u'_score': u'1',
u'_source': {u'query': {u'ql': u'angry'},u'tags':u'',u'filters':u"gid>=10 OR gid<=3"},
u'_type': u'doc',
u'fields': {u'_percolator_document_slot': [1]}},
{u'_id': u'2811025403043381501',
u'_index': u'pq',
u'_score': u'1',
u'_source': {u'query': {u'ql': u'filter test'},u'tags':u'',u'filters':u"gid>=10"},
u'_type': u'doc',
u'fields': {u'_percolator_document_slot': [1]}}],
'total': 2},
'profile': None,
'timed_out': False,
'took': 0}
percolateRequest = new PercolateRequest();
query = new HashMap<String,Object>(){{
put("percolate",new HashMap<String,Object >(){{
put("documents", new ArrayList<Object>(){{
add(new HashMap<String,Object >(){{
put("title","angry test");
put("gid",3);
}});
add(new HashMap<String,Object >(){{
put("title","filter test doc2");
put("gid",13);
}});
}});
}});
}};
percolateRequest.query(query);
searchApi.percolate("pq",percolateRequest);
class SearchResponse {
took: 10
timedOut: false
hits: class SearchResponseHits {
total: 2
maxScore: 1
hits: [{_index=pq, _type=doc, _id=2811045522851234165, _score=1, _source={query={ql=@title angry}}, fields={_percolator_document_slot=[1]}}, {_index=pq, _type=doc, _id=2811045522851234166, _score=1, _source={query={ql=@title filter test doc2}}, fields={_percolator_document_slot=[2]}}]
aggregations: null
}
profile: null
}
var doc1 = new Dictionary<string, Object>();
doc1.Add("title","angry test");
doc1.Add("gid",3);
var doc2 = new Dictionary<string, Object>();
doc2.Add("title","filter test doc2");
doc2.Add("gid",13);
var docs = new List<Object> {doc1, doc2};
Dictionary<string, Object> percolateDoc = new Dictionary<string, Object> {{ "documents", docs }};
Dictionary<string, Object> query = new Dictionary<string, Object> {{ "percolate", percolateDoc }};
PercolateRequest percolateRequest = new PercolateRequest(query=query);
searchApi.Percolate("pq",percolateRequest);
class SearchResponse {
took: 10
timedOut: false
hits: class SearchResponseHits {
total: 2
maxScore: 1
hits: [{_index=pq, _type=doc, _id=2811045522851234165, _score=1, _source={query={ql=@title angry}}, fields={_percolator_document_slot=[1]}}, {_index=pq, _type=doc, _id=2811045522851234166, _score=1, _source={query={ql=@title filter test doc2}}, fields={_percolator_document_slot=[2]}}]
aggregations: null
}
profile: null
}
docs = [ {title : 'What a nice bag'}, {title : 'Really nice shoes'} ];
res = await searchApi.percolate('test_pq', { query: { percolate: { documents : docs } } } );
{
"took": 0,
"timed_out": false,
"hits": {
"total": 2,
"hits": [
{
"_index": "test_pq",
"_type": "doc",
"_id": "1657852401006149661",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
},
{
"_index": "test_pq",
"_type": "doc",
"_id": "1657852401006149662",
"_score": "1",
"_source": {
"query": {
"ql": "@title shoes"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
doc1 := map[string]interface{} {"title": "What a nice bag"}
doc2 := map[string]interface{} {"title": "Really nice shoes"}
query := []interface{} {doc1, doc2}
percolateRequestQuery := manticoreclient.NewPercolateQuery(query)
percolateRequest := manticoreclient.NewPercolateRequest(percolateRequestQuery)
res, _, _ := apiClient.SearchAPI.Percolate(context.Background(), "test_pq").PercolateRequest(*percolateRequest).Execute()
{
"took": 0,
"timed_out": false,
"hits": {
"total": 2,
"hits": [
{
"_index": "test_pq",
"_type": "doc",
"_id": "1657852401006149661",
"_score": "1",
"_source": {
"query": {
"ql": "@title bag"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
},
{
"_index": "test_pq",
"_type": "doc",
"_id": "1657852401006149662",
"_score": "1",
"_source": {
"query": {
"ql": "@title shoes"
}
},
"fields": {
"_percolator_document_slot": [
1
]
}
}
]
}
}
In the previous example, we used the default sparse mode. To demonstrate the sharded mode, let's create a distributed PQ table consisting of 2 local PQ tables and add 2 documents to "products1" and 1 document to "products2":
create table products1(title text, color string) type='pq';
create table products2(title text, color string) type='pq';
create table products_distributed type='distributed' local='products1' local='products2';
INSERT INTO products1(query) values('@title bag');
INSERT INTO products1(query,filters) values('@title shoes', 'color=\'red\'');
INSERT INTO products2(query,filters) values('@title shoes', 'color in (\'blue\', \'green\')');
Now, if you add 'sharded' as mode to CALL PQ, it will send the documents to all the agent's tables (in this case, just local tables, but they can be remote to utilize external hardware). This mode is not available via the JSON interface.
SQL:
CALL PQ('products_distributed', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag"}'), 'sharded' as mode, 1 as query);
+---------------------+--------------+------+---------------------------+
| id | query | tags | filters |
+---------------------+--------------+------+---------------------------+
| 1657852401006149639 | @title bag | | |
| 1657852401006149643 | @title shoes | | color IN ('blue, 'green') |
+---------------------+--------------+------+---------------------------+
Note that the syntax of agent mirrors in the configuration (when several hosts are assigned to one agent line, separated with |) has nothing to do with the CALL PQ query mode. Each agent always represents one node, regardless of the number of HA mirrors specified for that agent.
In some cases, you might want to get more details about the performance of a percolate query. For that purpose, there is the option 1 as verbose, which is only available via SQL and allows you to save more performance metrics. You can see them using the SHOW META query, which you can run after CALL PQ. See SHOW META for more info.
1 as verbose:
CALL PQ('products', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag"}'), 1 as verbose); show meta;
+---------------------+
| id |
+---------------------+
| 1657852401006149644 |
| 1657852401006149646 |
+---------------------+
+-------------------------+-----------+
| Name | Value |
+-------------------------+-----------+
| Total | 0.000 sec |
| Setup | 0.000 sec |
| Queries matched | 2 |
| Queries failed | 0 |
| Document matched | 2 |
| Total queries stored | 3 |
| Term only queries | 3 |
| Fast rejected queries | 0 |
| Time per query | 27, 10 |
| Time of matched queries | 37 |
+-------------------------+-----------+
CALL PQ('products', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag"}'), 0 as verbose); show meta;
+---------------------+
| id |
+---------------------+
| 1657852401006149644 |
| 1657852401006149646 |
+---------------------+
+-----------------------+-----------+
| Name | Value |
+-----------------------+-----------+
| Total | 0.000 sec |
| Queries matched | 2 |
| Queries failed | 0 |
| Document matched | 2 |
| Total queries stored | 3 |
| Term only queries | 3 |
| Fast rejected queries | 0 |
+-----------------------+-----------+
Autocomplete (or word completion) is a feature in which an application predicts the rest of a word a user is typing. On websites, it's used in search boxes, where a user starts to type a word, and a dropdown with suggestions pops up so the user can select the ending from the list.

There are a few ways you can do autocomplete in Manticore:
To autocomplete a sentence, you can use infixed search. You can find endings of a document's field by providing its beginning and:
* to match anything it substitutes^ to start from the beginning of the field"" for phrase matchingThere is an article about it in our blog and an interactive course. A quick example is:
My cat loves my dog. The cat (Felis catus) is a domestic species of small carnivorous mammal.^, "", and * so as the user is typing, you make queries like: ^"m*", ^"my *", ^"my c*", ^"my ca*" and so on<b>My cat</b> loves my dog. The cat ( ...In some cases, all you need is to autocomplete a single word or a couple of words. In this case, you can use CALL KEYWORDS.
CALL KEYWORDS is available through the SQL interface and offers a way to examine how keywords are tokenized or to obtain the tokenized forms of specific keywords. If the table enables infixes, it allows you to quickly find possible endings for given keywords, making it suitable for autocomplete functionality.
This is a great alternative to general infixed search, as it provides higher performance since it only needs the table's dictionary, not the documents themselves.
CALL KEYWORDS(text, table [, options])
The CALL KEYWORDS statement divides text into keywords. It returns the tokenized and normalized forms of the keywords, and if desired, keyword statistics. Additionally, it provides the position of each keyword in the query and all forms of tokenized keywords when the table enables lemmatizers.
| Parameter | Description |
|---|---|
| text | Text to break down to keywords |
| table | Name of the table from which to take the text processing settings |
| 0/1 as stats | Show statistics of keywords, default is 0 |
| 0/1 as fold_wildcards | Fold wildcards, default is 0 |
| 0/1 as fold_lemmas | Fold morphological lemmas, default is 0 |
| 0/1 as fold_blended | Fold blended words, default is 0 |
| N as expansion_limit | Override expansion_limit defined in the server configuration, default is 0 (use value from the configuration) |
| docs/hits as sort_mode | Sort output results by either 'docs' or 'hits'. Default no sorting |
The examples show how it works if assuming the user is trying to get an autocomplete for "my cat ...". So on the application side all you need to do is to suggest the user the endings from the column "normalized" for each new word. It often makes sense to sort by hits or docs using 'hits' as sort_mode or 'docs' as sort_mode.
MySQL [(none)]> CALL KEYWORDS('m*', 't', 1 as stats);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | m* | my | 1 | 2 |
| 1 | m* | mammal | 1 | 1 |
+------+-----------+------------+------+------+
MySQL [(none)]> CALL KEYWORDS('my*', 't', 1 as stats);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | my* | my | 1 | 2 |
+------+-----------+------------+------+------+
MySQL [(none)]> CALL KEYWORDS('c*', 't', 1 as stats, 'hits' as sort_mode);
+------+-----------+-------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+-------------+------+------+
| 1 | c* | cat | 1 | 2 |
| 1 | c* | carnivorous | 1 | 1 |
| 1 | c* | catus | 1 | 1 |
+------+-----------+-------------+------+------+
MySQL [(none)]> CALL KEYWORDS('ca*', 't', 1 as stats, 'hits' as sort_mode);
+------+-----------+-------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+-------------+------+------+
| 1 | ca* | cat | 1 | 2 |
| 1 | ca* | carnivorous | 1 | 1 |
| 1 | ca* | catus | 1 | 1 |
+------+-----------+-------------+------+------+
MySQL [(none)]> CALL KEYWORDS('cat*', 't', 1 as stats, 'hits' as sort_mode);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | cat* | cat | 1 | 2 |
| 1 | cat* | catus | 1 | 1 |
+------+-----------+------------+------+------+
There is a nice trick how you can improve the above algorithm - use bigram_index. When you have it enabled for the table what you get in it is not just a single word, but each pair of words standing one after another indexed as a separate token.
This allows to predict not just the current word's ending, but the next word too which is especially beneficial for the purpose of autocomplete.
MySQL [(none)]> CALL KEYWORDS('m*', 't', 1 as stats, 'hits' as sort_mode);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | m* | my | 1 | 2 |
| 1 | m* | mammal | 1 | 1 |
| 1 | m* | my cat | 1 | 1 |
| 1 | m* | my dog | 1 | 1 |
+------+-----------+------------+------+------+
MySQL [(none)]> CALL KEYWORDS('my*', 't', 1 as stats, 'hits' as sort_mode);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | my* | my | 1 | 2 |
| 1 | my* | my cat | 1 | 1 |
| 1 | my* | my dog | 1 | 1 |
+------+-----------+------------+------+------+
MySQL [(none)]> CALL KEYWORDS('c*', 't', 1 as stats, 'hits' as sort_mode);
+------+-----------+--------------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+--------------------+------+------+
| 1 | c* | cat | 1 | 2 |
| 1 | c* | carnivorous | 1 | 1 |
| 1 | c* | carnivorous mammal | 1 | 1 |
| 1 | c* | cat felis | 1 | 1 |
| 1 | c* | cat loves | 1 | 1 |
| 1 | c* | catus | 1 | 1 |
| 1 | c* | catus is | 1 | 1 |
+------+-----------+--------------------+------+------+
MySQL [(none)]> CALL KEYWORDS('ca*', 't', 1 as stats, 'hits' as sort_mode);
+------+-----------+--------------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+--------------------+------+------+
| 1 | ca* | cat | 1 | 2 |
| 1 | ca* | carnivorous | 1 | 1 |
| 1 | ca* | carnivorous mammal | 1 | 1 |
| 1 | ca* | cat felis | 1 | 1 |
| 1 | ca* | cat loves | 1 | 1 |
| 1 | ca* | catus | 1 | 1 |
| 1 | ca* | catus is | 1 | 1 |
+------+-----------+--------------------+------+------+
MySQL [(none)]> CALL KEYWORDS('cat*', 't', 1 as stats, 'hits' as sort_mode);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | cat* | cat | 1 | 2 |
| 1 | cat* | cat felis | 1 | 1 |
| 1 | cat* | cat loves | 1 | 1 |
| 1 | cat* | catus | 1 | 1 |
| 1 | cat* | catus is | 1 | 1 |
+------+-----------+------------+------+------+
CALL KEYWORDS supports distributed tables so no matter how big your data set you can benefit from using it.
Spell correction, also known as:
and so on, is a software functionality that suggests alternatives to or makes automatic corrections of the text you have typed in. The concept of correcting typed text dates back to the 1960s when computer scientist Warren Teitelman, who also invented the "undo" command, introduced a philosophy of computing called D.W.I.M., or "Do What I Mean." Instead of programming computers to accept only perfectly formatted instructions, Teitelman argued that they should be programmed to recognize obvious mistakes.
The first well-known product to provide spell correction functionality was Microsoft Word 6.0, released in 1993.
There are a few ways spell correction can be done, but it's important to note that there is no purely programmatic way to convert your mistyped "ipone" into "iphone" with decent quality. Mostly, there has to be a dataset the system is based on. The dataset can be:
Manticore provides the commands CALL QSUGGEST and CALL SUGGEST that can be used for automatic spell correction purposes.
Both commands are available via SQL only, and the general syntax is:
CALL QSUGGEST(word, table [,options])
CALL SUGGEST(word, table [,options])
options: N as option_name[, M as another_option, ...]
These commands provide all suggestions from the dictionary for a given word. They work only on tables with infixing enabled and dict=keywords. They return the suggested keywords, Levenshtein distance between the suggested and original keywords, and the document statistics of the suggested keyword.
If the first parameter contains multiple words, then:
CALL QSUGGEST will return suggestions only for the last word, ignoring the rest.CALL SUGGEST will return suggestions only for the first word.That's the only difference between them. Several options are supported for customization:
| Option | Description | Default |
|---|---|---|
| limit | Returns N top matches | 5 |
| max_edits | Keeps only dictionary words with a Levenshtein distance less than or equal to N | 4 |
| result_stats | Provides Levenshtein distance and document count of the found words | 1 (enabled) |
| delta_len | Keeps only dictionary words with a length difference less than N | 3 |
| max_matches | Number of matches to keep | 25 |
| reject | Rejected words are matches that are not better than those already in the match queue. They are put in a rejected queue that gets reset in case one actually can go in the match queue. This parameter defines the size of the rejected queue (as reject*max(max_matched,limit)). If the rejected queue is filled, the engine stops looking for potential matches | 4 |
| result_line | alternate mode to display the data by returning all suggests, distances and docs each per one row | 0 |
| non_char | do not skip dictionary words with non alphabet symbols | 0 (skip such words) |
| sentence | Returns the original sentence along with the last word replaced by the matched one. | 0 (do not return the full sentence) |
To show how it works, let's create a table and add a few documents to it.
create table products(title text) min_infix_len='2';
insert into products values (0,'Crossbody Bag with Tassel'), (0,'microfiber sheet set'), (0,'Pet Hair Remover Glove');
As you can see, the mistyped word "crossbUdy" gets corrected to "crossbody". By default, CALL SUGGEST/QSUGGEST return:
distance - the Levenshtein distance which means how many edits they had to make to convert the given word to the suggestiondocs - number of documents containing the suggested wordTo disable the display of these statistics, you can use the option 0 as result_stats.
call suggest('crossbudy', 'products');
+-----------+----------+------+
| suggest | distance | docs |
+-----------+----------+------+
| crossbody | 1 | 1 |
+-----------+----------+------+
If the first parameter is not a single word, but multiple, then CALL SUGGEST will return suggestions only for the first word.
call suggest('bagg with tasel', 'products');
+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| bag | 1 | 1 |
+---------+----------+------+
If the first parameter is not a single word, but multiple, then CALL SUGGEST will return suggestions only for the last word.
CALL QSUGGEST('bagg with tasel', 'products');
+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| tassel | 1 | 1 |
+---------+----------+------+
Adding 1 as sentence makes CALL QSUGGEST return the entire sentence with the last word corrected.
CALL QSUGGEST('bag with tasel', 'products', 1 as sentence);
+-------------------+----------+------+
| suggest | distance | docs |
+-------------------+----------+------+
| bag with tassel | 1 | 1 |
+-------------------+----------+------+
The 1 as result_line option changes the way the suggestions are displayed in the output. Instead of showing each suggestion in a separate row, it displays all suggestions, distances, and docs in a single row. Here's an example to demonstrate this:
CALL QSUGGEST('bagg with tasel', 'products', 1 as result_line);
+----------+--------+
| name | value |
+----------+--------+
| suggests | tassel |
| distance | 1 |
| docs | 1 |
+----------+--------+
This interactive course demonstrates online how the spell correction feature works on a web page and experiment with different examples.

Query cache stores compressed result sets in memory and reuses them for subsequent queries when possible. You can configure it using the following directives:
qcache_max_bytes to 0 completely disables the query cache.These settings can be changed on the fly using the SET GLOBAL statement:
mysql> SET GLOBAL qcache_max_bytes=128000000;
These changes are applied immediately, and cached result sets that no longer satisfy the constraints are immediately discarded. When reducing the cache size on the fly, MRU (most recently used) result sets win.
Query cache operates as follows. When enabled, every full-text search result is completely stored in memory. This occurs after full-text matching, filtering, and ranking, so essentially we store total_found {docid,weight} pairs. Compressed matches can consume anywhere from 2 bytes to 12 bytes per match on average, mostly depending on the deltas between subsequent docids. Once the query is complete, we check the wall time and size thresholds, and either save the compressed result set for reuse or discard it.
Note that the query cache's impact on RAM is not limited byqcache_max_bytes! If you run, for example, 10 concurrent queries, each matching up to 1M matches (after filters), then the peak temporary RAM usage will be in the range of 40 MB to 240 MB, even if the queries are fast enough and don't get cached.
Queries can use cache when the table, full-text query (i.e.,MATCH() contents), and ranker all match, and filters are compatible. This means:
MATCH() must be a bytewise match. Add a single extra space, and it's now a different query as far as the query cache is concerned.Cache entries expire with TTL and also get invalidated on table rotation, or on TRUNCATE, or on ATTACH. Note that currently, entries are not invalidated on arbitrary RT table writes! So a cached query might return older results for the duration of its TTL.
You can inspect the current cache status with SHOW STATUS through the qcache_XXX variables:
mysql> SHOW STATUS LIKE 'qcache%';
+-----------------------+----------+
| Counter | Value |
+-----------------------+----------+
| qcache_max_bytes | 16777216 |
| qcache_thresh_msec | 3000 |
| qcache_ttl_sec | 60 |
| qcache_cached_queries | 0 |
| qcache_used_bytes | 0 |
| qcache_hits | 0 |
+-----------------------+----------+
6 rows in set (0.00 sec)
Collations primarily impact string attribute comparisons. They define both the character set encoding and the strategy Manticore employs for comparing strings when performing ORDER BY or GROUP BY with a string attribute involved.
String attributes are stored as-is during indexing, and no character set or language information is attached to them. This is fine as long as Manticore only needs to store and return the strings to the calling application verbatim. However, when you ask Manticore to sort by a string value, the request immediately becomes ambiguous.
First, single-byte (ASCII, ISO-8859-1, or Windows-1251) strings need to be processed differently than UTF-8 strings, which may encode each character with a variable number of bytes. Thus, we need to know the character set type to properly interpret the raw bytes as meaningful characters.
Second, we also need to know the language-specific string sorting rules. For example, when sorting according to US rules in the en_US locale, the accented character ï (small letter i with diaeresis) should be placed somewhere after z. However, when sorting with French rules and the fr_FR locale in mind, it should be placed between i and j. Some other set of rules might choose to ignore accents altogether, allowing ï and i to be mixed arbitrarily.
Third, in some cases, we may require case-sensitive sorting, while in others, case-insensitive sorting is needed.
Collations encapsulate all of the following: the character set, the language rules, and the case sensitivity. Manticore currently provides four collations:
libc_cilibc_csutf8_general_cibinaryThe first two collations rely on several standard C library (libc) calls and can thus support any locale installed on your system. They provide case-insensitive (_ci) and case-sensitive (_cs) comparisons, respectively. By default, they use the C locale, effectively resorting to bytewise comparisons. To change that, you need to specify a different available locale using the collation_libc_locale directive. The list of locales available on your system can usually be obtained with the locale command:
$ locale -a
C
en_AG
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_NG
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZW.utf8
es_ES
fr_FR
POSIX
ru_RU.utf8
ru_UA.utf8
The specific list of system locales may vary. Consult your OS documentation to install additional needed locales.
utf8_general_ci and binary locales are built-in into Manticore. The first one is a generic collation for UTF-8 data (without any so-called language tailoring); it should behave similarly to the utf8_general_ci collation in MySQL. The second one is a simple bytewise comparison.
Collation can be overridden via SQL on a per-session basis using the SET collation_connection statement. All subsequent SQL queries will use this collation. Otherwise, all queries will use the server default collation or as specified in the collation_server configuration directive. Manticore currently defaults to the libc_ci collation.
Collations affect all string attribute comparisons, including those within ORDER BY and GROUP BY, so differently ordered or grouped results can be returned depending on the collation chosen. Note that collations don't affect full-text searching; for that, use the charset_table.
When Manticore executes a fullscan query, it can either use a plain scan to check every document against the filters or employ additional data and/or algorithms to speed up query execution. Manticore uses a cost-based optimizer (CBO), also known as a "query optimizer" to determine which approach to take.
The CBO can also enhance the performance of full-text queries. See below for more details.
The CBO may decide to replace one or more query filters with one of the following entities if it determines that doing so will improve performance:
.spt extension. Besides improving filters on document IDs, the docid index is also used to accelerate document ID to row ID lookups and to speed up the application of large killlists during daemon startup..spidx extension.The optimizer estimates the cost of each execution path using various attribute statistics, including:
.sphi files). Histograms are generated automatically when data is indexed and serve as the primary source of information for the CBO.The optimizer computes the execution cost for every filter used in a query. Since certain filters can be replaced with several different entities (e.g., for a document id, Manticore can use a plain scan, a docid index lookup, a columnar scan (if the document id is columnar), and a secondary index), the optimizer evaluates all available combinations. However, there is a maximum limit of 1024 combinations.
To estimate query execution costs, the optimizer calculates the estimated costs of the most significant operations performed when executing the query. It uses preset constants to represent the cost of each operation.
The optimizer compares the costs of each execution path and chooses the path with the lowest cost to execute the query.
When working with full-text queries that have filters by attributes, the query optimizer decides between two possible execution paths. One is to execute the full-text query, retrieve the matches, and use filters. The other is to replace filters with one or more entities described above, fetch rowids from them, and inject them into the full-text matching tree. This way, full-text search results will intersect with full-scan results. The query optimizer estimates the cost of full-text tree evaluation and the best possible path for computing filter results. Using this information, the optimizer chooses the execution path.
Another factor to consider is multithreaded query execution (when pseudo_sharding is enabled). The CBO is aware that some queries can be executed in multiple threads and takes this into account. The CBO prioritizes shorter query execution times (i.e., latency) over throughput. For instance, if a query using a columnar scan can be executed in multiple threads (and occupy multiple CPU cores) and is faster than a query executed in a single thread using secondary indexes, multithreaded execution will be preferred.
Queries using secondary indexes and docid indexes always run in a single thread, as benchmarks indicate that there is little to no benefit in making them multithreaded.
At present, the optimizer only uses CPU costs and does not take memory or disk usage into account.
Manticore Search supports the ability to add embeddings generated by your Machine Learning models to each document, and then doing a nearest-neighbor search on them. This lets you build features like similarity search, recommendations, semantic search, and relevance ranking based on NLP algorithms, among others, including image, video, and sound searches.
An embedding is a method of representing data—such as text, images, or sound—as vectors in a high-dimensional space. These vectors are crafted to ensure that the distance between them reflects the similarity of the data they represent. This process typically employs algorithms like word embeddings (e.g., Word2Vec, BERT) for text or neural networks for images. The high-dimensional nature of the vector space, with many components per vector, allows for the representation of complex and nuanced relationships between items. Their similarity is gauged by the distance between these vectors, often measured using methods like Euclidean distance or cosine similarity.
Manticore Search enables k-nearest neighbor (KNN) vector searches using the HNSW library. This functionality is part of the Manticore Columnar Library.
To run KNN searches, you must first configure your table. It needs to have at least one float_vector attribute, which serves as a data vector. You need to specify the following properties:
knn_type: A mandatory setting; currently, only hnsw is supported.knn_dims: A mandatory setting that specifies the dimensions of the vectors being indexed.hnsw_similarity: A mandatory setting that specifies the distance function used by the HNSW index. Acceptable values are:L2 - Squared L2IP - Inner productCOSINE - Cosine similarity
hnsw_m: An optional setting that defines the maximum number of outgoing connections in the graph. The default is 16.
hnsw_ef_construction: An optional setting that defines a construction time/accuracy trade-off.create table test ( title text, image_vector float_vector knn_type='hnsw' knn_dims='4' hnsw_similarity='l2' );
Query OK, 0 rows affected (0.01 sec)
After creating the table, you need to insert your vector data, ensuring it matches the dimensions you specified when creating the table.
insert into test values ( 1, 'yellow bag', (0.653448,0.192478,0.017971,0.339821) ), ( 2, 'white bag', (-0.148894,0.748278,0.091892,-0.095406) );
Query OK, 2 rows affected (0.00 sec)
POST /insert
{
"index":"test_vec",
"id":1,
"doc": { "title" : "yellow bag", "image_vector" : [0.653448,0.192478,0.017971,0.339821] }
}
POST /insert
{
"index":"test_vec",
"id":2,
"doc": { "title" : "white bag", "image_vector" : [-0.148894,0.748278,0.091892,-0.095406] }
}
{
"_index":"test",
"_id":1,
"created":true,
"result":"created",
"status":201
}
{
"_index":"test",
"_id":2,
"created":true,
"result":"created",
"status":201
}
Now, you can perform a KNN search using the knn clause in either SQL or JSON format. Both interfaces support the same essential parameters, ensuring a consistent experience regardless of the format you choose:
select ... from <table name> where knn ( <field>, <k>, <query vector> [,<ef>] )POST /search
{
"index": "<table name>",
"knn":
{
"field": "<field>",
"query_vector": [<query vector>],
"k": <k>,
"ef": <ef>
}
}The parameters are:
field: This is the name of the float vector attribute containing vector data.k: This represents the number of documents to return and is a key parameter for Hierarchical Navigable Small World (HNSW) indexes. It specifies the quantity of documents that a single HNSW index should return. However, the actual number of documents included in the final results may vary. For instance, if the system is dealing with real-time tables divided into disk chunks, each chunk could return k documents, leading to a total that exceeds the specified k (as the cumulative count would be num_chunks * k). On the other hand, the final document count might be less than k if, after requesting k documents, some are filtered out based on specific attributes. It's important to note that the parameter k does not apply to ramchunks. In the context of ramchunks, the retrieval process operates differently, and thus, the k parameter's effect on the number of documents returned is not applicable.query_vector: This is the search vector.ef: optional size of the dynamic list used during the search. A higher ef leads to more accurate but slower search.Documents are always sorted by their distance to the search vector. Any additional sorting criteria you specify will be applied after this primary sort condition. For retrieving the distance, there is a built-in function called knn_dist().
select id, knn_dist() from test where knn ( image_vector, 5, (0.286569,-0.031816,0.066684,0.032926), 2000 );
+------+------------+
| id | knn_dist() |
+------+------------+
| 1 | 0.28146550 |
| 2 | 0.81527930 |
+------+------------+
2 rows in set (0.00 sec)
POST /search
{
"index": "test",
"knn":
{
"field": "image_vector",
"query_vector": [0.286569,-0.031816,0.066684,0.032926],
"k": 5,
"ef": 2000
}
}
{
"took":0,
"timed_out":false,
"hits":
{
"total":2,
"total_relation":"eq",
"hits":
[
{
"_id":"1",
"_score":1,
"_knn_dist":0.28146550,
"_source":
{
"title":"yellow bag",
"image_vector":[0.653448,0.192478,0.017971,0.339821]
}
},
{
"_id":"2",
"_score":1,
"_knn_dist":0.81527930,
"_source":
{
"title":"white bag",
"image_vector":[-0.148894,0.748278,0.091892,-0.095406]
}
}
]
}
}
Finding documents similar to a specific one based on its unique ID is a common task. For instance, when a user views a particular item, Manticore Search can efficiently identify and display a list of items that are most similar to it in the vector space. Here's how you can do it:
select ... from <table name> where knn ( <field>, <k>, <document id> )POST /search
{
"index": "<table name>",
"knn":
{
"field": "<field>",
"doc_id": <document id>,
"k": <k>
}
}The parameters are:
field: This is the name of the float vector attribute containing vector data.k: This represents the number of documents to return and is a key parameter for Hierarchical Navigable Small World (HNSW) indexes. It specifies the quantity of documents that a single HNSW index should return. However, the actual number of documents included in the final results may vary. For instance, if the system is dealing with real-time tables divided into disk chunks, each chunk could return k documents, leading to a total that exceeds the specified k (as the cumulative count would be num_chunks * k). On the other hand, the final document count might be less than k if, after requesting k documents, some are filtered out based on specific attributes. It's important to note that the parameter k does not apply to ramchunks. In the context of ramchunks, the retrieval process operates differently, and thus, the k parameter's effect on the number of documents returned is not applicable.document id: Document ID for KNN similarity search.select id, knn_dist() from test where knn ( image_vector, 5, 1 );
+------+------------+
| id | knn_dist() |
+------+------------+
| 2 | 0.81527930 |
+------+------------+
1 row in set (0.00 sec)
POST /search
{
"index": "test",
"knn":
{
"field": "image_vector",
"doc_id": 1,
"k": 5
}
}
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"total_relation":"eq",
"hits":
[
{
"_id":"2",
"_score":1643,
"_knn_dist":0.81527930,
"_source":
{
"title":"white bag",
"image_vector":[-0.148894,0.748278,0.091892,-0.095406]
}
}
]
}
}
Manticore also supports additional filtering of documents returned by the KNN search, either by full-text matching, attribute filters, or both.
select id, knn_dist() from test where knn ( image_vector, 5, (0.286569,-0.031816,0.066684,0.032926) ) and match('white') and id < 10;
+------+------------+
| id | knn_dist() |
+------+------------+
| 2 | 0.81527930 |
+------+------------+
1 row in set (0.00 sec)
POST /search
{
"index": "test",
"knn":
{
"field": "image_vector",
"query_vector": [0.286569,-0.031816,0.066684,0.032926],
"k": 5,
"filter":
{
"bool":
{
"must":
[
{ "match": {"_all":"white"} },
{ "range": { "id": { "lt": 10 } } }
]
}
}
}
}
{
"took":0,
"timed_out":false,
"hits":
{
"total":1,
"total_relation":"eq",
"hits":
[
{
"_id":"2",
"_score":1643,
"_knn_dist":0.81527930,
"_source":
{
"title":"white bag",
"image_vector":[-0.148894,0.748278,0.091892,-0.095406]
}
}
]
}
}
ALTER TABLE table ADD COLUMN column_name [{INTEGER|INT|BIGINT|FLOAT|BOOL|MULTI|MULTI64|JSON|STRING|TIMESTAMP|TEXT [INDEXED [ATTRIBUTE]]}] [engine='columnar']
ALTER TABLE table DROP COLUMN column_name
ALTER TABLE table MODIFY COLUMN column_name bigint
This feature only supports adding one field at a time for RT tables or the expansion of an int column to bigint. The supported data types are:
int - integer attributetimestamp - timestamp attributebigint - big integer attributefloat - float attributebool - boolean attributemulti - multi-valued integer attributemulti64 - multi-valued bigint attributejson - json attributestring / text attribute / string attribute - string attributetext / text indexed stored / string indexed stored - full-text indexed field with original value stored in docstoretext indexed / string indexed - full-text indexed field, indexed only (the original value is not stored in docstore)text indexed attribute / string indexed attribute - full text indexed field + string attribute (not storing the original value in docstore)text stored / string stored - the value will be only stored in docstore, not full-text indexed, not a string attributeengine='columnar' to any attribute (except for json) will make it stored in the columnar storageALTERing it to avoid data corruption in case of a sudden power interruption or other similar issues.ALTER will not work for distributed tables and tables without any attributes.id column.ALTER DROP drops the attribute, the second one drops the full-text field.mysql> desc rt;
+------------+-----------+
| Field | Type |
+------------+-----------+
| id | bigint |
| text | field |
| group_id | uint |
| date_added | timestamp |
+------------+-----------+
mysql> alter table rt add column test integer;
mysql> desc rt;
+------------+-----------+
| Field | Type |
+------------+-----------+
| id | bigint |
| text | field |
| group_id | uint |
| date_added | timestamp |
| test | uint |
+------------+-----------+
mysql> alter table rt drop column group_id;
mysql> desc rt;
+------------+-----------+
| Field | Type |
+------------+-----------+
| id | bigint |
| text | field |
| date_added | timestamp |
| test | uint |
+------------+-----------+
mysql> alter table rt add column title text indexed;
mysql> desc rt;
+------------+-----------+------------+
| Field | Type | Properties |
+------------+-----------+------------+
| id | bigint | |
| text | text | indexed |
| title | text | indexed |
| date_added | timestamp | |
| test | uint | |
+------------+-----------+------------+
mysql> alter table rt add column title text attribute;
mysql> desc rt;
+------------+-----------+------------+
| Field | Type | Properties |
+------------+-----------+------------+
| id | bigint | |
| text | text | indexed |
| title | text | indexed |
| date_added | timestamp | |
| test | uint | |
| title | string | |
+------------+-----------+------------+
mysql> alter table rt drop column title;
mysql> desc rt;
+------------+-----------+------------+
| Field | Type | Properties |
+------------+-----------+------------+
| id | bigint | |
| text | text | indexed |
| title | text | indexed |
| date_added | timestamp | |
| test | uint | |
+------------+-----------+------------+
mysql> alter table rt drop column title;
mysql> desc rt;
+------------+-----------+------------+
| Field | Type | Properties |
+------------+-----------+------------+
| id | bigint | |
| text | text | indexed |
| date_added | timestamp | |
| test | uint | |
+------------+-----------+------------+
ALTER TABLE table ft_setting='value'[, ft_setting2='value']
You can use ALTER to modify the full-text settings of your table in RT mode. However, it only affects new documents and not existing ones.
Example:
charset_table that allows only 3 searchable characters: a, b and c.abcd, the d just gets ignored since it's not in the charset_table arrayd to be searchable too, so we add it with help of ALTERwhere match('abcd') still says it searched by abc, because the existing document remembers previous contents of charset_tableabcd and search by abcd againshow meta says it used two keywords: abc (to find the old document) and abcd (for the new one).mysql> create table rt(title text) charset_table='a,b,c';
mysql> insert into rt(title) values('abcd');
mysql> select * from rt where match('abcd');
+---------------------+-------+
| id | title |
+---------------------+-------+
| 1514630637682688054 | abcd |
+---------------------+-------+
mysql> show meta;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 1 |
| total_found | 1 |
| time | 0.000 |
| keyword[0] | abc |
| docs[0] | 1 |
| hits[0] | 1 |
+---------------+-------+
mysql> alter table rt charset_table='a,b,c,d';
mysql> select * from rt where match('abcd');
+---------------------+-------+
| id | title |
+---------------------+-------+
| 1514630637682688054 | abcd |
+---------------------+-------+
mysql> show meta
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 1 |
| total_found | 1 |
| time | 0.000 |
| keyword[0] | abc |
| docs[0] | 1 |
| hits[0] | 1 |
+---------------+-------+
mysql> insert into rt(title) values('abcd');
mysql> select * from rt where match('abcd');
+---------------------+-------+
| id | title |
+---------------------+-------+
| 1514630637682688055 | abcd |
| 1514630637682688054 | abcd |
+---------------------+-------+
mysql> show meta;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 2 |
| total_found | 2 |
| time | 0.000 |
| keyword[0] | abc |
| docs[0] | 1 |
| hits[0] | 1 |
| keyword[1] | abcd |
| docs[1] | 1 |
| hits[1] | 1 |
+---------------+-------+
ALTER TABLE table RECONFIGURE
ALTER can also reconfigure an RT table in the plain mode, so that new tokenization, morphology and other text processing settings from the configuration file take effect for new documents. Note, that the existing document will be left intact. Internally, it forcibly saves the current RAM chunk as a new disk chunk and adjusts the table header, so that new documents are tokenized using the updated full-text settings.
mysql> show table rt settings;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| settings | |
+---------------+-------+
1 row in set (0.00 sec)
mysql> alter table rt reconfigure;
Query OK, 0 rows affected (0.00 sec)
mysql> show table rt settings;
+---------------+----------------------+
| Variable_name | Value |
+---------------+----------------------+
| settings | morphology = stem_en |
+---------------+----------------------+
1 row in set (0.00 sec)
ALTER TABLE table REBUILD SECONDARY
You can also use ALTER to rebuild secondary indexes in a given table. Sometimes, a secondary index can be disabled for the entire table or for one or multiple attributes within the table:
ALTER TABLE table REBUILD SECONDARY rebuilds secondary indexes from attribute data and enables them again.
Additionally, an old version of secondary indexes may be supported but will lack certain features. REBUILD SECONDARY can be used to update secondary indexes.
ALTER TABLE rt REBUILD SECONDARY;
Query OK, 0 rows affected (0.00 sec)
Returns the absolute value of the argument.
Returns the arctangent function of two arguments, expressed in radians.
BITDOT(mask, w0, w1, ...) returns the sum of products of each bit of a mask multiplied by its weight. bit0*w0 + bit1*w1 + ...
Returns the smallest integer value greater than or equal to the argument.
Returns the cosine of the argument.
Returns the CRC32 value of a string argument.
Returns the exponent of the argument (e=2.718... to the power of the argument).
Returns the N-th Fibonacci number, where N is the integer argument. That is, arguments of 0 and up will generate the values 0, 1, 1, 2, 3, 5, 8, 13 and so on. Note that the computations are done using 32-bit integer math and thus numbers 48th and up will be returned modulo 2^32.
Returns the largest integer value lesser than or equal to the argument.
GREATEST(attr_json.some_array) function takes a JSON array as the argument, and returns the greatest value in that array. Also works for MVA.
Returns the result of an integer division of the first argument by the second argument. Both arguments must be of an integer type.
LEAST(attr_json.some_array) function takes a JSON array as the argument, and returns the least value in that array. Also works for MVA.
Returns the natural logarithm of the argument (with the base of e=2.718...).
Returns the common logarithm of the argument (with the base of 10).
Returns the binary logarithm of the argument (with the base of 2).
Returns the larger of two arguments.
Returns the smaller of two arguments.
Returns the first argument raised to the power of the second argument.
Returns a random float between 0 and 1. It can optionally accept a seed, which can be a constant integer or an integer attribute's name.
If you use a seed, keep in mind that it resets rand()'s starting point separately for each plain table, RT disk, RAM chunk, or pseudo shard. Therefore, queries to a distributed table in any form can return multiple identical random values.
Returns the sine of the argument.
Returns the square root of the argument.
BM25A(k1,b) returns the exact BM25A() value. Requires the expr ranker and enabled index_field_lengths. Parameters k1 and b must be floats.
BM25F(k1, b, {field=weight, ...}) returns the exact BM25F() value and requires index_field_lengths to be enabled. The expr ranker is also necessary. Parameters k1 and b must be floats.
Substitutes non-existent columns with default values. It returns either the value of an attribute specified by 'attr-name', or the 'default-value' if that attribute does not exist. STRING or MVA attributes are not supported. This function is useful when searching through multiple tables with different schemas.
SELECT *, EXIST('gid', 6) as cnd FROM i1, i2 WHERE cnd>5
Returns the sort key value of the worst-ranked element in the current top-N matches if the sort key is a float, and 0 otherwise.
Returns the weight of the worst-ranked element in the current top-N matches.
PACKEDFACTORS() can be used in queries to display all calculated weighting factors during matching or to provide a binary attribute for creating a custom ranking UDF. This function only works if the expression ranker is specified and the query is not a full scan; otherwise, it returns an error. PACKEDFACTORS() can take an optional argument that disables ATC ranking factor calculation: PACKEDFACTORS({no_atc=1}). Calculating ATC significantly slows down query processing, so this option can be useful if you need to see the ranking factors but don't require ATC. PACKEDFACTORS() can also output in JSON format: PACKEDFACTORS({json=1}). The respective outputs in either key-value pair or JSON format are shown below. (Note that the examples below are wrapped for readability; actual returned values would be single-line.)
mysql> SELECT id, PACKEDFACTORS() FROM test1
-> WHERE MATCH('test one') OPTION ranker=expr('1') \G
*************************** 1\. row ***************************
id: 1
packedfactors(): bm25=569, bm25a=0.617197, field_mask=2, doc_word_count=2,
field1=(lcs=1, hit_count=2, word_count=2, tf_idf=0.152356,
min_idf=-0.062982, max_idf=0.215338, sum_idf=0.152356, min_hit_pos=4,
min_best_span_pos=4, exact_hit=0, max_window_hits=1, min_gaps=2,
exact_order=1, lccs=1, wlccs=0.215338, atc=-0.003974),
word0=(tf=1, idf=-0.062982),
word1=(tf=1, idf=0.215338)
1 row in set (0.00 sec)
mysql> SELECT id, PACKEDFACTORS({json=1}) FROM test1
-> WHERE MATCH('test one') OPTION ranker=expr('1') \G
*************************** 1\. row ***************************
id: 1
packedfactors({json=1}):
{
"bm25": 569,
"bm25a": 0.617197,
"field_mask": 2,
"doc_word_count": 2,
"fields": [
{
"lcs": 1,
"hit_count": 2,
"word_count": 2,
"tf_idf": 0.152356,
"min_idf": -0.062982,
"max_idf": 0.215338,
"sum_idf": 0.152356,
"min_hit_pos": 4,
"min_best_span_pos": 4,
"exact_hit": 0,
"max_window_hits": 1,
"min_gaps": 2,
"exact_order": 1,
"lccs": 1,
"wlccs": 0.215338,
"atc": -0.003974
}
],
"words": [
{
"tf": 1,
"idf": -0.062982
},
{
"tf": 1,
"idf": 0.215338
}
]
}
1 row in set (0.01 sec)
This function can be used to implement custom ranking functions in UDFs, as in:
SELECT *, CUSTOM_RANK(PACKEDFACTORS()) AS r
FROM my_index
WHERE match('hello')
ORDER BY r DESC
OPTION ranker=expr('1');
Where CUSTOM_RANK() is a function implemented in a UDF. It should declare a SPH_UDF_FACTORS structure (defined in sphinxudf.h), initialize this structure, unpack the factors into it before usage, and deinitialize it afterwards, as follows:
SPH_UDF_FACTORS factors;
sphinx_factors_init(&factors);
sphinx_factors_unpack((DWORD*)args->arg_values[0], &factors);
// ... can use the contents of factors variable here ...
sphinx_factors_deinit(&factors);
PACKEDFACTORS() data is available at all query stages, not just during the initial matching and ranking pass. This enables another particularly interesting application of PACKEDFACTORS(): re-ranking.
In the example above, we used an expression-based ranker with a dummy expression and sorted the result set by the value computed by our UDF. In other words, we used the UDF to rank all our results. Now, let's assume for the sake of an example that our UDF is extremely expensive to compute, with a throughput of only 10,000 calls per second. If our query matches 1,000,000 documents, we would want to use a much simpler expression to do most of our ranking in order to maintain reasonable performance. Then, we would apply the expensive UDF to only a few top results, say, the top 100 results. In other words, we would build the top 100 results using a simpler ranking function and then re-rank those with a more complex one. This can be done with subselects:
SELECT * FROM (
SELECT *, CUSTOM_RANK(PACKEDFACTORS()) AS r
FROM my_index WHERE match('hello')
OPTION ranker=expr('sum(lcs)*1000+bm25')
ORDER BY WEIGHT() DESC
LIMIT 100
) ORDER BY r DESC LIMIT 10
In this example, the expression-based ranker is called for every matched document to compute WEIGHT(), so it gets called 1,000,000 times. However, the UDF computation can be postponed until the outer sort, and it will only be performed for the top 100 matches by WEIGHT(), according to the inner limit. This means the UDF will only be called 100 times. Finally, the top 10 matches by UDF value are selected and returned to the application.
For reference, in a distributed setup, the PACKEDFACTORS() data is sent from the agents to the master node in binary format. This makes it technically feasible to implement additional re-ranking passes on the master node if needed.
When used in SQL but not called from any UDFs, the result of PACKEDFACTORS() is formatted as plain text, which can be used to manually assess the ranking factors. Note that this feature is not currently supported by the Manticore API.
REMOVE_REPEATS ( result_set, column, offset, limit ) - removes repeated adjusted rows with the same 'column' value.
SELECT REMOVE_REPEATS((SELECT * FROM dist1), gid, 0, 10)
The WEIGHT() function returns the calculated matching score. If no ordering is specified, the result is sorted in descending order by the score provided by WEIGHT(). In this example, we order first by weight and then by an integer attribute.
The search above performs a simple matching, where all words need to be present. However, we can do more (and this is just a simple example):
mysql> SELECT *,WEIGHT() FROM testrt WHERE MATCH('"list of business laptops"/3');
+------+------+-------------------------------------+---------------------------+----------+
| id | gid | title | content | weight() |
+------+------+-------------------------------------+---------------------------+----------+
| 1 | 10 | List of HP business laptops | Elitebook Probook | 2397 |
| 2 | 10 | List of Dell business laptops | Latitude Precision Vostro | 2397 |
| 3 | 20 | List of Dell gaming laptops | Inspirion Alienware | 2375 |
| 5 | 30 | List of ASUS ultrabooks and laptops | Zenbook Vivobook | 2375 |
+------+------+-------------------------------------+---------------------------+----------+
4 rows in set (0.00 sec)
mysql> SHOW META;
+----------------+----------+
| Variable_name | Value |
+----------------+----------+
| total | 4 |
| total_found | 4 |
| total_relation | eq |
| time | 0.000 |
| keyword[0] | list |
| docs[0] | 5 |
| hits[0] | 5 |
| keyword[1] | of |
| docs[1] | 4 |
| hits[1] | 4 |
| keyword[2] | business |
| docs[2] | 2 |
| hits[2] | 2 |
| keyword[3] | laptops |
| docs[3] | 5 |
| hits[3] | 5 |
+----------------+----------+
16 rows in set (0.00 sec)
Here, we search for four words, but a match can occur even if only three of the four words are found. The search will rank documents containing all words higher.
The ZONESPANLIST() function returns pairs of matched zone spans. Each pair contains the matched zone span identifier, a colon, and the order number of the matched zone span. For example, if a document reads <emphasis role="bold"><i>text</i> the <i>text</i></emphasis>, and you query for 'ZONESPAN:(i,b) text', then ZONESPANLIST() will return the string "1:1 1:2 2:1", meaning that the first zone span matched "text" in spans 1 and 2, and the second zone span in span 1 only.
QUERY() returns the current search query. QUERY() is a postlimit expression and is intended to be used with SNIPPET().
Table functions are a mechanism for post-query result set processing. Table functions take an arbitrary result set as input and return a new, processed set as output. The first argument should be the input result set, but a table function can optionally take and handle more arguments. Table functions can completely change the result set, including the schema. Currently, only built-in table functions are supported. Table functions work for both outer SELECT and nested SELECT.
Type casting comprises three principal actions: conversion, reinterpretation, and promotion.
TO_STRING() function.SINT(), doesn't involve extra computations; instead, it merely reinterprets existing data.TIMEDIFF() function usually returns a string, but can also return a number. So, BIGINT(TIMEDIFF(1,2)) will execute successfully, compelling TIMEDIFF() to supply an integer value. Conversely, DATE_FORMAT() solely returns strings and can't yield a number, meaning that BIGINT(DATE_FORMAT(...)) will fail.This function promotes an integer argument to a 64-bit type, leaving floating-point arguments untouched. It's designed to ensure the evaluation of specific expressions (such as a*b) in 64-bit mode, even if all arguments are 32-bit.
The DOUBLE() function promotes its argument to a floating-point type. This is designed to help enforce the evaluation of numeric JSON fields.
The INTEGER() function promotes its argument to a 64-bit signed type. This is designed to enforce the evaluation of numeric JSON fields.
This function forcefully converts its argument to a string type.
The UINT() function promotes its argument to a 32-bit unsigned integer type.
The UINT64() function promotes its argument to a 64-bit unsigned integer type.
The SINT() function forcefully reinterprets its 32-bit unsigned integer argument as signed and extends it to a 64-bit type (since the 32-bit type is unsigned). For instance, 1-2 ordinarily evaluates to 4294967295, but SINT(1-2) evaluates to -1.
ALL(cond FOR var IN json.array) applies to JSON arrays and returns 1 if the condition is true for all elements in the array and 0 otherwise. cond is a general expression that can also use var as the current value of an array element within itself.
select *, ALL(x>0 AND x<4 FOR x IN j.ar) from tbl
+------+--------------+--------------------------------+
| id | j | all(x>0 and x<4 for x in j.ar) |
+------+--------------+--------------------------------+
| 1 | {"ar":[1,3]} | 1 |
| 2 | {"ar":[3,7]} | 0 |
+------+--------------+--------------------------------+
2 rows in set (0.00 sec)
select *, ALL(x>0 AND x<4 FOR x IN j.ar) cond from tbl where cond=1
+------+--------------+------+
| id | j | cond |
+------+--------------+------+
| 1 | {"ar":[1,3]} | 1 |
+------+--------------+------+
1 row in set (0.00 sec)
ALL(mva) is a special constructor for multi-value attributes. When used with comparison operators (including comparison with IN()), it returns 1 if all values from the MVA attribute are found among the compared values.
select * from tbl where all(m) >= 1
+------+------+
| id | m |
+------+------+
| 1 | 1,3 |
| 2 | 3,7 |
+------+------+
2 rows in set (0.00 sec)
select * from tbl where all(m) in (1, 3, 7, 10)
+------+------+
| id | m |
+------+------+
| 1 | 1,3 |
| 2 | 3,7 |
+------+------+
2 rows in set (0.00 sec)
To compare an MVA attribute with an array, avoid using <mva> NOT ALL(); use ALL(<mva>) NOT IN() instead.
select * from tbl where all(m) not in (2, 4)
+------+------+
| id | m |
+------+------+
| 1 | 1,3 |
| 2 | 3,7 |
+------+------+
2 rows in set (0.00 sec)
ALL(string list) is a special operation for filtering string tags.
If all of the words enumerated as arguments of ALL() are present in the attribute, the filter matches. The optional NOT inverts the logic.
This filter internally uses doc-by-doc matching, so in the case of a full scan query, it might be slower than expected. It is intended for attributes that are not indexed, like calculated expressions or tags in PQ tables. If you need such filtering, consider the solution of putting the string attribute as a full-text field, and then use the full-text operator match(), which will invoke a full-text search.
select * from tbl where tags all('bug', 'release')
+------+---------------------------+
| id | tags |
+------+---------------------------+
| 1 | bug priority_high release |
| 2 | bug priority_low release |
+------+---------------------------+
2 rows in set (0.00 sec)
mysql> select * from tbl
+------+---------------------------+
| id | tags |
+------+---------------------------+
| 1 | bug priority_high release |
| 2 | bug priority_low release |
+------+---------------------------+
2 rows in set (0.00 sec)
mysql> select * from tbl where tags not all('bug')
Empty set (0.00 sec)
ANY(cond FOR var IN json.array) applies to JSON arrays and returns 1 if the condition is true for any element in the array and 0 otherwise. cond is a general expression that can also use var as the current value of an array element within itself.
select *, ANY(x>5 AND x<10 FOR x IN j.ar) from tbl
+------+--------------+---------------------------------+
| id | j | any(x>5 and x<10 for x in j.ar) |
+------+--------------+---------------------------------+
| 1 | {"ar":[1,3]} | 0 |
| 2 | {"ar":[3,7]} | 1 |
+------+--------------+---------------------------------+
2 rows in set (0.00 sec)
select *, ANY(x>5 AND x<10 FOR x IN j.ar) cond from tbl where cond=1
+------+--------------+------+
| id | j | cond |
+------+--------------+------+
| 2 | {"ar":[3,7]} | 1 |
+------+--------------+------+
1 row in set (0.00 sec)
ANY(mva) is a special constructor for multi-value attributes. When used with comparison operators (including comparison with IN()), it returns 1 if any of the MVA values is found among the compared values.
When comparing an array using IN(), ANY() is assumed by default if not otherwise specified, but a warning will be issued regarding the missing constructor.
mysql> select * from tbl
+------+------+
| id | m |
+------+------+
| 1 | 1,3 |
| 2 | 3,7 |
+------+------+
2 rows in set (0.01 sec)
mysql> select * from tbl where any(m) > 5
+------+------+
| id | m |
+------+------+
| 2 | 3,7 |
+------+------+
1 row in set (0.00 sec)
select * from tbl where any(m) in (1, 7, 10)
+------+------+
| id | m |
+------+------+
| 1 | 1,3 |
| 2 | 3,7 |
+------+------+
2 rows in set (0.00 sec)
To compare an MVA attribute with an array, avoid using <mva> NOT ANY(); use <mva> NOT IN() instead or ANY(<mva>) NOT IN().
mysql> select * from tbl
+------+------+
| id | m |
+------+------+
| 1 | 1,3 |
| 2 | 3,7 |
+------+------+
2 rows in set (0.00 sec)
mysql> select * from tbl where any(m) not in (1, 3, 5)
+------+------+
| id | m |
+------+------+
| 2 | 3,7 |
+------+------+
1 row in set (0.00 sec)
ANY(string list) is a special operation for filtering string tags.
If any of the words enumerated as arguments of ANY() is present in the attribute, the filter matches. The optional NOT inverts the logic.
This filter internally uses doc-by-doc matching, so in the case of a full scan query, it might be slower than expected. It is intended for attributes that are not indexed, like calculated expressions or tags in PQ tables. If you need such filtering, consider the solution of putting the string attribute as a full-text field, and then use the full-text operator match(), which will invoke a full-text search.
select * from tbl where tags any('bug', 'feature')
+------+---------------------------+
| id | tags |
+------+---------------------------+
| 1 | bug priority_high release |
| 2 | bug priority_low release |
+------+---------------------------+
2 rows in set (0.00 sec)
select * from tbl
--------------
+------+---------------------------+
| id | tags |
+------+---------------------------+
| 1 | bug priority_high release |
| 2 | bug priority_low release |
+------+---------------------------+
2 rows in set (0.00 sec)
--------------
select * from tbl where tags not any('feature', 'priority_low')
--------------
+------+---------------------------+
| id | tags |
+------+---------------------------+
| 1 | bug priority_high release |
+------+---------------------------+
1 row in set (0.01 sec)
CONTAINS(polygon, x, y) checks whether the (x,y) point is within the given polygon, and returns 1 if true, or 0 if false. The polygon has to be specified using either the POLY2D() function. The former function is intended for "small" polygons, meaning less than 500 km (300 miles) a side, and it doesn't take into account the Earth's curvature for speed. For larger distances, you should use GEOPOLY2D, which tessellates the given polygon in smaller parts, accounting for the Earth's curvature.
The behavior of IF() is slightly different from its MySQL counterpart. It takes 3 arguments, checks whether the 1st argument is equal to 0.0, returns the 2nd argument if it is not zero, or the 3rd one when it is. Note that unlike comparison operators, IF() does not use a threshold! Therefore, it's safe to use comparison results as its 1st argument, but arithmetic operators might produce unexpected results. For instance, the following two calls will produce different results even though they are logically equivalent:
IF ( sqrt(3)*sqrt(3)-3<>0, a, b )
IF ( sqrt(3)*sqrt(3)-3, a, b )
In the first case, the comparison operator <> will return 0.0 (false) due to a threshold, and IF() will always return ** as a result. In the second case, the same sqrt(3)*sqrt(3)-3 expression will be compared with zero without a threshold by the IF() function itself. However, its value will be slightly different from zero due to limited floating-point calculation precision. Because of this, the comparison with 0.0 done by IF() will not pass, and the second variant will return 'a' as a result.
HISTOGRAM(expr, {hist_interval=size, hist_offset=value}) takes a bucket size and returns the bucket number for the value. The key function is:
key_of_the_bucket = interval + offset * floor ( ( value - offset ) / interval )
The histogram argument interval must be positive. The histogram argument offset must be positive and less than interval. It is used in aggregation, FACET, and grouping.
Example:
SELECT COUNT(*),
HISTOGRAM(price, {hist_interval=100}) as price_range
FROM facets
GROUP BY price_range ORDER BY price_range ASC;
IN(expr,val1,val2,...) takes 2 or more arguments and returns 1 if the 1st argument (expr) is equal to any of the other arguments (val1..valN), or 0 otherwise. Currently, all the checked values (but not the expression itself) are required to be constant. The constants are pre-sorted, and binary search is used, so IN() even against a large arbitrary list of constants will be very quick. The first argument can also be an MVA attribute. In that case, IN() will return 1 if any of the MVA values are equal to any of the other arguments. IN() also supports IN(expr,@uservar) syntax to check whether the value belongs to the list in the given global user variable. The first argument can be a JSON attribute.
INDEXOF(cond FOR var IN json.array) function iterates through all elements in the array and returns the index of the first element for which 'cond' is true, and -1 if 'cond' is false for every element in the array.
INTERVAL(expr,point1,point2,point3,...) takes 2 or more arguments and returns the index of the argument that is less than the first argument: it returns 0 if expr<point1, 1 if point1<=expr<point2, and so on. It is required that point1<point2<...<pointN for this function to work correctly.
LENGTH(attr_mva) function returns the number of elements in an MVA set. It works with both 32-bit and 64-bit MVA attributes. LENGTH(attr_json) returns the length of a field in JSON. The return value depends on the type of field. For example, LENGTH(json_attr.some_int) always returns 1, and LENGTH(json_attr.some_array) returns the number of elements in the array. LENGTH(string_expr) function returns the length of the string resulting from an expression.
TO_STRING() must enclose the expression, regardless of whether the expression returns a non-string or it's simply a string attribute.
RANGE(expr, {range_from=value,range_to=value}) takes a set of ranges and returns the bucket number for the value.
This expression includes the range_from value and excludes the range_to value for each range. A range can be open - having only the range_from or only the range_to value. It is used in aggregation, FACET, and grouping.
Example:
SELECT COUNT(*),
RANGE(price, {range_to=150},{range_from=150,range_to=300},{range_from=300}) price_range
FROM facets
GROUP BY price_range ORDER BY price_range ASC;
REMAP(condition, expression, (cond1, cond2, ...), (expr1, expr2, ...)) function allows you to make some exceptions to expression values depending on condition values. The condition expression should always result in an integer, while the expression can result in an integer or float.
Example:
SELECT id, size, REMAP(size, 15, (5,6,7,8), (1,1,2,2)) s
FROM products
ORDER BY s ASC;
SELECT REMAP(userid, karmapoints, (1, 67), (999, 0)) FROM users;
SELECT REMAP(id%10, salary, (0), (0.0)) FROM employes;
This will put documents with sizes 5 and 6 first, followed by sizes 7 and 8. In case there's an original value not listed in the array (e.g. size 10), it will default to 15, and in this case, will be placed at the end.
Note, that CURTIME(), UTC_TIME(), UTC_TIMESTAMP(), and TIMEDIFF() can be promoted to numeric types using arbitrary conversion functions such as BIGINT(), DOUBLE(), etc.
Returns the current timestamp as an INTEGER.
select NOW();
+------------+
| NOW() |
+------------+
| 1615788407 |
+------------+
Returns the current time in the local timezone in hh:ii:ss format.
select CURTIME();
+-----------+
| CURTIME() |
+-----------+
| 07:06:30 |
+-----------+
Returns the current date in the local timezone in YYYY-MM-DD format.
select curdate();
+------------+
| curdate() |
+------------+
| 2023-08-02 |
+------------+
Returns the current time in UTC timezone in hh:ii:ss format.
select UTC_TIME();
+------------+
| UTC_TIME() |
+------------+
| 06:06:18 |
+------------+
Returns the current time in UTC timezone in YYYY-MM-DD hh:ii:ss format.
select UTC_TIMESTAMP();
+---------------------+
| UTC_TIMESTAMP() |
+---------------------+
| 2021-03-15 06:06:03 |
+---------------------+
Returns the integer second (in 0..59 range) from a timestamp argument, according to the current timezone.
select second(now());
+---------------+
| second(now()) |
+---------------+
| 52 |
+---------------+
Returns the integer minute (in 0..59 range) from a timestamp argument, according to the current timezone.
select minute(now());
+---------------+
| minute(now()) |
+---------------+
| 5 |
+---------------+
Returns the integer hour (in 0..23 range) from a timestamp argument, according to the current timezone.
select hour(now());
+-------------+
| hour(now()) |
+-------------+
| 7 |
+-------------+
Returns the integer day of the month (in 1..31 range) from a timestamp argument, according to the current timezone.
select day(now());
+------------+
| day(now()) |
+------------+
| 15 |
+------------+
Returns the integer month (in 1..12 range) from a timestamp argument, according to the current timezone.
select month(now());
+--------------+
| month(now()) |
+--------------+
| 3 |
+--------------+
Returns the integer quarter of the year (in 1..4 range) from a timestamp argument, according to the current timezone.
select quarter(now());
+----------------+
| quarter(now()) |
+----------------+
| 2 |
+----------------+
Returns the integer year (in 1969..2038 range) from a timestamp argument, according to the current timezone.
select year(now());
+-------------+
| year(now()) |
+-------------+
| 2024 |
+-------------+
Returns the weekday name for a given timestamp argument, according to the current timezone.
select dayname(now());
+----------------+
| dayname(now()) |
+----------------+
| Wednesday |
+----------------+
Returns the name of the month for a given timestamp argument, according to the current timezone.
select monthname(now());
+------------------+
| monthname(now()) |
+------------------+
| August |
+------------------+
Returns the integer weekday index (in 1..7 range) for a given timestamp argument, according to the current timezone.
Note that the week starts on Sunday.
select dayofweek(now());
+------------------+
| dayofweek(now()) |
+------------------+
| 5 |
+------------------+
Returns the integer day of the year (in 1..366 range) for a given timestamp argument, according to the current timezone.
select dayofyear(now());
+------------------+
| dayofyear(now()) |
+------------------+
| 214 |
+------------------+
Returns the integer year and the day code of the first day of current week (in 1969001..2038366 range) for a given timestamp argument, according to the current timezone.
select yearweek(now());
+-----------------+
| yearweek(now()) |
+-----------------+
| 2023211 |
+-----------------+
Returns the integer year and month code (in 196912..203801 range) from a timestamp argument, according to the current timezone.
select yearmonth(now());
+------------------+
| yearmonth(now()) |
+------------------+
| 202103 |
+------------------+
Returns the integer year, month, and date code (ranging from 19691231 to 20380119) based on the current timezone.
select yearmonthday(now());
+---------------------+
| yearmonthday(now()) |
+---------------------+
| 20210315 |
+---------------------+
Calculates the difference between two timestamps in the format hh:ii:ss.
select timediff(1615787586, 1613787583);
+----------------------------------+
| timediff(1615787586, 1613787583) |
+----------------------------------+
| 555:33:23 |
+----------------------------------+
Calculates the number of days between two given timestamps.
select datediff(1615787586, 1613787583);
+----------------------------------+
| datediff(1615787586, 1613787583) |
+----------------------------------+
| 23 |
+----------------------------------+
Formats the date part from a timestamp argument as a string in YYYY-MM-DD format.
select date(now());
+-------------+
| date(now()) |
+-------------+
| 2023-08-02 |
+-------------+
Formats the time part from a timestamp argument as a string in HH:MM:SS format.
select time(now());
+-------------+
| time(now()) |
+-------------+
| 15:21:27 |
+-------------+
Returns a formatted string based on the provided date and format arguments. The format argument uses the same specifiers as the strftime function. For convenience, here are some common format specifiers:
%Y - Four-digit year%m - Two-digit month (01-12)%d - Two-digit day of the month (01-31)%H - Two-digit hour (00-23)%M - Two-digit minute (00-59)%S - Two-digit second (00-59)%T - Time in 24-hour format (%H:%M:%S)Note that this is not a complete list of the specifiers. Please consult the documentation for strftime() for your operating system to get the full list.
SELECT DATE_FORMAT(NOW(), 'year %Y and time %T');
+------------------------------------------+
| DATE_FORMAT(NOW(), 'year %Y and time %T') |
+------------------------------------------+
| year 2023 and time 11:54:52 |
+------------------------------------------+
This example formats the current date and time, displaying the four-digit year and the time in 24-hour format.
DATE_HISTOGRAM(expr, {calendar_interval='unit_name'}) Takes a bucket size as a unit name and returns the bucket number for the value. Values are rounded to the closest bucket. The key function is:
key_of_the_bucket = interval * floor ( value / interval )
Intervals are specified using the unit name, such as week or as a single unit like 1M. Multiple units such as 2w are not supported.
The valid intervals are:
minute, 1mhour, 1hday, 1dweek, 1w (a week is the interval between the start day of the week, hour, minute, second and the next week but the same day and time of the week)month, 1Myear, 1y (a year is the interval between the start day of the month, time and the next year but the same day of the month, time)Used in aggregation, FACET, and grouping.
Example:
SELECT COUNT(*),
DATE_HISTOGRAM(tm, {calendar_interval='month'}) AS months
FROM facets
GROUP BY months ORDER BY months ASC;
DATE_RANGE(expr, {range_from='date_math', range_to='date_math'}) takes a set of ranges and returns the bucket number for the value.
The expression includes the range_from value and excludes the range_to value for each range. The range can be open - having only the range_from or only the range_to value.
The difference between this and the RANGE() function is that the range_from and range_to values can be expressed in Date math expressions.
Used in aggregation, FACET, and grouping.
Example:
SELECT COUNT(*),
DATE_RANGE(tm, {range_to='2017||+2M/M'},{range_from='2017||+2M/M',range_to='2017||+5M/M'},{range_from='2017||+5M/M'}) AS points
FROM idx_dates
GROUP BY points ORDER BY points ASC;
Date math lets you work with dates and times directly in your searches. It's especially useful for handling data that changes over time. With date math, you can easily do things like find entries from a certain period, analyze data trends, or manage when information should be removed. It simplifies working with dates by letting you add or subtract time from a given date, round dates to the nearest time unit, and more, all within your search queries.
To use date math, you start with a base date, which can be:
- now for the current date and time,
- or a specific date string ending with ||.
Then, you can modify this date with operations like:
- +1y to add one year,
- -1h to subtract one hour,
- /m to round to the nearest month.
You can use these units in your operations:
- s for seconds,
- m for minutes,
- h (or H) for hours,
- d for days,
- w for weeks,
- M for months,
- y for years.
Here are some examples of how you might use date math:
- now+4h means four hours from now.
- now-2d/d is the time two days ago, rounded to the nearest day.
- 2010-04-20||+2M/d is June 20, 2010, rounded to the nearest day.
GEODIST(lat1, lon1, lat2, lon2, [...]) function calculates the geosphere distance between two points specified by their coordinates. Note that by default, both latitudes and longitudes must be in radians, and the result will be in meters. You can use arbitrary expressions for any of the four coordinates. An optimized path will be chosen when one pair of arguments directly refers to a pair of attributes, and the other one is constant.
GEODIST() also accepts an optional 5th argument, allowing you to easily convert between input and output units and select the specific geodistance formula to use. The complete syntax and a few examples are as follows:
GEODIST(lat1, lon1, lat2, lon2, { option=value, ... })
GEODIST(40.7643929, -73.9997683, 40.7642578, -73.9994565, {in=degrees, out=feet})
GEODIST(51.50, -0.12, 29.98, 31.13, {in=deg, out=mi})
The known options and their values are:
in = {deg | degrees | rad | radians}, specifies the input units;out = {m | meters | km | kilometers | ft | feet | mi | miles}, specifies the output units;method = {adaptive | haversine}, specifies the geodistance calculation method.The default method is "adaptive". It is a well-optimized implementation that is both more precise and much faster at all times than "haversine".
GEOPOLY2D(lat1,lon1,lat2,lon2,lat3,lon3...) creates a polygon to be used with the CONTAINS() function. This function takes into account the Earth's curvature by tessellating the polygon into smaller ones, and should be used for larger areas. For small areas, the POLY2D() function can be used instead. The function expects coordinates to be pairs of latitude/longitude coordinates in degrees; if radians are used, it will give the same result as POLY2D().
POLY2D(x1,y1,x2,y2,x3,y3...) creates a polygon to be used with the CONTAINS() function. This polygon assumes a flat Earth, so it should not be too large; for large areas, the GEOPOLY2D() function, which takes Earth's curvature into consideration, should be used.
Concatenates two or more strings into one. Non-string arguments must be explicitly converted to string using the TO_STRING() function.
CONCAT(TO_STRING(float_attr), ',', TO_STRING(int_attr), ',', title)
LEVENSHTEIN ( pattern, source, {normalize=0, length_delta=0}) returns number (Levenshtein distance) of single-character edits (insertions, deletions or substitutions) between pattern and source strings required to make in pattern to make it source.
pattern, source - constant string, string field name, JSON field name, or any expression that produces a string (like e.g., SUBSTRING_INDEX())normalize - option to return the distance as a float number in the range [0.0 - 1.0], where 0.0 is an exact match, and 1.0 is the maximum difference. The default value is 0, meaning not to normalize and provide the result as an integer.length_delta - skips Levenshtein distance calculation and returns max(strlen(pattern), strlen(source)) if the option is set and the lengths of the strings differ by more than the length_delta value. The default value is 0, meaning to calculate Levenshtein distance for any input strings. This option can be useful when checking mostly similar strings.SELECT LEVENSHTEIN('gily', attr1) AS dist, WEIGHT() AS w FROM test WHERE MATCH('test') ORDER BY w DESC, dist ASC;
SELECT LEVENSHTEIN('gily', j.name, {length_delta=6}) AS dist, WEIGHT() AS w FROM test WHERE MATCH('test') ORDER BY w DESC;
SELECT LEVENSHTEIN(title, j.name, {normalize=1}) AS dist, WEIGHT() AS w FROM test WHERE MATCH ('test') ORDER BY w DESC, dist ASC;
The REGEX(attr,expr) function returns 1 if a regular expression matches the attribute's string, and 0 otherwise. It works with both string and JSON attributes.
SELECT REGEX(content, 'box?') FROM test;
SELECT REGEX(j.color, 'red | pink') FROM test;
Expressions should adhere to the RE2 syntax. To perform a case-insensitive search, for instance, you can use:
SELECT REGEX(content, '(?i)box') FROM test;
The SNIPPET() function can be used to highlight search results within a given text. The first two arguments are: the text to be highlighted, and a query. Options can be passed to the function as the third, fourth, and so on arguments. SNIPPET() can obtain the text for highlighting directly from the table. In this case, the first argument should be the field name:
SELECT SNIPPET(body, QUERY()) FROM myIndex WHERE MATCH('my.query')
In this example, the QUERY() expression returns the current full-text query. SNIPPET() can also highlight non-indexed text:
mysql SELECT id, SNIPPET('text to highlight', 'my.query', 'limit=100') FROM myIndex WHERE MATCH('my.query')
Additionally, it can be used to highlight text fetched from other sources using a User-Defined Function (UDF):
SELECT id, SNIPPET(myUdf(id), 'my.query', 'limit=100') FROM myIndex WHERE MATCH('my.query')
In this context, myUdf() is a User-Defined Function (UDF) that retrieves a document by its ID from an external storage source. The SNIPPET() function is considered a "post limit" function, which means that the computation of snippets is delayed until the entire final result set is prepared, and even after the LIMIT clause has been applied. For instance, if a LIMIT 20,10 clause is used, SNIPPET() will be called no more than 10 times.
It is important to note that SNIPPET() does not support field-based limitations. For this functionality, use HIGHLIGHT() instead.
SUBSTRING_INDEX(string, delimiter, number) returns a substring of the original string, based on a specified number of delimiter occurrences:
SUBSTRING_INDEX() by default returns a string, but it can also be coerced into other types (such as integer or float) if necessary. Numeric values can be converted using specific functions (such as BIGINT(), DOUBLE(), etc.).
SELECT SUBSTRING_INDEX('www.w3schools.com', '.', 2) FROM test;
SELECT SUBSTRING_INDEX(j.coord, ' ', 1) FROM test;
SELECT SUBSTRING_INDEX('1.2 3.4', ' ', 1); /* '1.2' */
SELECT SUBSTRING_INDEX('1.2 3.4', ' ', -1); /* '3.4' */
SELECT sint ( SUBSTRING_INDEX('1.2 3.4', ' ', 1)); /* 1 */
SELECT sint ( SUBSTRING_INDEX('1.2 3.4', ' ', -1)); /* 3 */
SELECT double ( SUBSTRING_INDEX('1.2 3.4', ' ', 1)); /* 1.200000 */
SELECT double ( SUBSTRING_INDEX('1.2 3.4', ' ', -1)); /* 3.400000 */
UPPER(string) convert argument to upper case, LOWER(string) convert argument to lower case.
Result also can be promoted to numeric, but only if string argument is convertible to a number. Numeric values could be promoted with arbitrary functions (BITINT, DOUBLE, etc.).
SELECT upper('www.w3schools.com', '.', 2); /* WWW.W3SCHOOLS.COM */
SELECT double (upper ('1.2e3')); /* 1200.000000 */
SELECT integer (lower ('12345')); /* 12345 */
Returns the IDs of documents that were inserted or replaced by the last statement in the current session.
The same value can also be obtained via the @@session.last_insert_id variable.
mysql> select @@session.last_insert_id;
+--------------------------+
| @@session.last_insert_id |
+--------------------------+
| 11,32 |
+--------------------------+
1 rows in set
mysql> select LAST_INSERT_ID();
+------------------+
| LAST_INSERT_ID() |
+------------------+
| 25,26,29 |
+------------------+
1 rows in set
Returns the current connection ID.
mysql> select CONNECTION_ID();
+-----------------+
| CONNECTION_ID() |
+-----------------+
| 6 |
+-----------------+
1 row in set (0.00 sec)
Returns KNN vector search distance.
mysql> select id, knn_dist() from test where knn ( image_vector, 5, (0.286569,-0.031816,0.066684,0.032926) ) and match('white') and id < 10;
+------+------------+
| id | knn_dist() |
+------+------------+
| 2 | 0.81527930 |
+------+------------+
1 row in set (0.00 sec)
Backing up your tables on a regular basis is essential for recovery in the event of system crashes, hardware failure, or data corruption/loss. It's also highly recommended to make backups before upgrading to a new Manticore Search version or running ALTER TABLE.
Backing up database systems can be done in two unique ways: logical and physical backups. Each of these methods has its pros and cons, which may vary based on the specific database environment and needs. Here, we'll delve into the distinction between these two types of backups.
Logical backups entail exporting the database schema and data as SQL statements or as data formats specific to the database. This backup form is typically readable by humans and can be employed to restore the database on various systems or database engines.
Pros and cons of logical backups:
- ➕ Portability: Logical backups are generally more portable than physical backups, as they can be used to restore the database on different hardware or operating systems.
- ➕ Flexibility: Logical backups allow you to selectively restore specific tables, indexes, or other database objects.
- ➕ Compatibility: Logical backups can be used to migrate data between different database management systems or versions, provided the target system supports the exported format or SQL statements.
- ➖ Slower Backup and Restore: Logical backups can be slower than physical backups, as they require the database engine to convert the data into SQL statements or another export format.
- ➖ Increased System Load: Creating logical backups can cause higher system load, as the process requires more CPU and memory resources to process and export the data.
Manticore Search supports mysqldump for logical backups.
Physical backups involve copying the raw data files and system files that comprise the database. This type of backup essentially creates a snapshot of the database's physical state at a given point in time.
Pros and cons of physical backups:
- ➕ Speed: Physical backups are usually faster than logical backups, as they involve copying raw data files directly from disk.
- ➕ Consistency: Physical backups ensure a consistent backup of the entire database, as all related files are copied together.
- ➕ Lower System Load: Creating physical backups generally places less load on the system compared to logical backups, as the process does not involve additional data processing.
- ➖ Portability: Physical backups are typically less portable than logical backups, as they may be dependent on the specific hardware, operating system, or database engine configuration.
- ➖ Flexibility: Physical backups do not allow for the selective restoration of specific database objects, as the backup contains the entire database's raw files.
- ➖ Compatibility: Physical backups cannot be used to migrate data between different database management systems or versions, as the raw data files may not be compatible across different platforms or software.
Manticore Search has manticore-backup command line tool for physical backups.
In summary, logical backups provide more flexibility, portability, and compatibility but can be slower and more resource-intensive, while physical backups are faster, more consistent, and less resource-intensive but may be limited in terms of portability and flexibility. The choice between these two backup methods will depend on your specific database environment, hardware, and requirements.
The manticore-backup tool, included in the official Manticore Search packages, automates the process of backing up tables for an instance running in RT mode.
If you followed the official installation instructions, you should already have everything installed and don't need to worry. Otherwise, manticore-backup requires PHP 8.1.10 and specific modules or manticore-executor, which is a part of the manticore-extra package, and you need to ensure that one of these is available.
Note that manticore-backup is not available for Windows yet.
First, make sure you're running manticore-backup on the same server where the Manticore instance you are about to back up is running.
Second, we recommend running the tool under the root user so the tool can transfer ownership of the files you are backing up. Otherwise, a backup will be also made but with no ownership transfer. In either case, you should make sure that manticore-backup has access to the data dir of the Manticore instance.
The only required argument for manticore-backup is --backup-dir, which specifies the destination for the backup. If you don't provide any additional arguments, manticore-backup will:
- locate a Manticore instance running with the default configuration
- create a subdirectory in the --backup-dir directory with a timestamped name
- backup all tables found in the instance
manticore-backup --config=path/to/manticore.conf --backup-dir=backupdir
Copyright (c) 2023-2024, Manticore Software LTD (https://manticoresearch.com)
Manticore config file: /etc/manticoresearch/manticore.conf
Tables to backup: all tables
Target dir: /mnt/backup/
Manticore config
endpoint = 127.0.0.1:9308
Manticore versions:
manticore: 5.0.2
columnar: 1.15.4
secondary: 1.15.4
2022-10-04 17:18:39 [Info] Starting the backup...
2022-10-04 17:18:39 [Info] Backing up config files...
2022-10-04 17:18:39 [Info] config files - OK
2022-10-04 17:18:39 [Info] Backing up tables...
2022-10-04 17:18:39 [Info] pq (percolate) [425B]...
2022-10-04 17:18:39 [Info] OK
2022-10-04 17:18:39 [Info] products (rt) [512B]...
2022-10-04 17:18:39 [Info] OK
2022-10-04 17:18:39 [Info] Running sync
2022-10-04 17:18:42 [Info] OK
2022-10-04 17:18:42 [Info] You can find backup here: /mnt/backup/backup-20221004171839
2022-10-04 17:18:42 [Info] Elapsed time: 2.76s
2022-10-04 17:18:42 [Info] Done
To back up specific tables only, use the --tables flag followed by a comma-separated list of tables, for example --tables=tbl1,tbl2. This will only backup the specified tables and ignore the rest.
manticore-backup --backup-dir=/mnt/backup/ --tables=products
Copyright (c) 2023-2024, Manticore Software LTD (https://manticoresearch.com)
Manticore config file: /etc/manticoresearch/manticore.conf
Tables to backup: products
Target dir: /mnt/backup/
Manticore config
endpoint = 127.0.0.1:9308
Manticore versions:
manticore: 5.0.3
columnar: 1.16.1
secondary: 0.0.0
2022-10-04 17:25:02 [Info] Starting the backup...
2022-10-04 17:25:02 [Info] Backing up config files...
2022-10-04 17:25:02 [Info] config files - OK
2022-10-04 17:25:02 [Info] Backing up tables...
2022-10-04 17:25:02 [Info] products (rt) [512B]...
2022-10-04 17:25:02 [Info] OK
2022-10-04 17:25:02 [Info] Running sync
2022-10-04 17:25:06 [Info] OK
2022-10-04 17:25:06 [Info] You can find backup here: /mnt/backup/backup-20221004172502
2022-10-04 17:25:06 [Info] Elapsed time: 4.82s
2022-10-04 17:25:06 [Info] Done
| Argument | Description |
|---|---|
--backup-dir=path |
This is the path to the backup directory where the backup will be stored. The directory must already exist. This argument is required and has no default value. On each backup run, manticore-backup will create a subdirectory in the provided directory with a timestamp in the name (backup-[datetime]), and will copy all required tables to it. So the --backup-dir is a container for all your backups, and it's safe to run the script multiple times. |
--restore[=backup] |
Restore from --backup-dir. Just --restore lists available backups. --restore=backup will restore from <--backup-dir>/backup. |
--force |
Skip versions check on restore and gracefully restore the backup. |
--disable-telemetry |
Pass this flag in case you want to disable sending anonymized metrics to Manticore. You can also use environment variable TELEMETRY=0 |
--config=/path/to/manticore.conf |
Path to the Manticore configuration. Optional. If not provided, a default configuration for your operating system will be used. Used to determine the host and port for communication with the Manticore daemon. The manticore-backup tool supports dynamic configuration files. You can specify the --config option multiple times if your configuration is spread across multiple files. |
--tables=tbl1,tbl2, ... |
Semicolon-separated list of tables that you want to back up. To back up all tables, omit this argument. All the provided tables must exist in the Manticore instance you are backing up from, or the backup will fail. |
--compress |
Whether the backed up files should be compressed. Not enabled by default. |
--unlock |
In rare cases when something goes wrong, tables can be left in a locked state. Use this argument to unlock them. |
--version |
Show the current version. |
--help |
Show this help. |
You can also back up your data through SQL by running the simple command BACKUP TO /path/to/backup.
Note, this command is not supported in Windows yet.
BACKUP
[{TABLE | TABLES} a[, b]]
[{OPTION | OPTIONS}
async = {on | off | 1 | 0 | true | false | yes | no}
[, compress = {on | off | 1 | 0 | true | false | yes | no}]
]
TO path_to_backup
For instance, to back up tables a and b to the /backup directory, run the following command:
BACKUP TABLES a, b TO /backup
There are options available to control and adjust the backup process, such as:
async: makes the backup non-blocking, allowing you to receive a response with the query ID immediately and run other queries while the backup is ongoing. The default value is 0.compress: enables file compression using zstd. The default value is 0./tmp directory:BACKUP OPTION async = yes, compress = yes TO /tmp
To ensure consistency of tables during backup, Manticore Search's backup tools use the innovative FREEZE and UNFREEZE commands. Unlike the traditional lock and unlock tables feature of e.g. MySQL, FREEZE stops flushing data to disk while still permitting writing (to some extent) and selecting updated data from the table.
However, if your RAM chunk size grows beyond the rt_mem_limit threshold during lengthy backup operations involving many inserts, data may be flushed to disk, and write operations will be blocked until flushing is complete. Despite this, the tool maintains a balance between table locking, data consistency, and database write availability while the table is frozen.
When you use manticore-backup or the SQL BACKUP command, the FREEZE command is executed once and freezes all tables you are backing up simultaneously. The backup process subsequently backs up each table one by one, releasing the freeze after successfully backing up each table.
If backup fails or gets interrupted, the tool tries to unfreeze all the tables.
To restore a Manticore instance from a backup, use the manticore-backup command with the --backup-dir and --restore arguments. For example: manticore-backup --backup-dir=/path/to/backups --restore. If you don't provide any argument for --restore, it will simply list all the backups in the --backup-dir.
manticore-backup --backup-dir=/mnt/backup/ --restore
Copyright (c) 2023-2024, Manticore Software LTD (https://manticoresearch.com)
Manticore config file:
Backup dir: /tmp/
Available backups: 3
backup-20221006144635 (Oct 06 2022 14:46:35)
backup-20221006145233 (Oct 06 2022 14:52:33)
backup-20221007104044 (Oct 07 2022 10:40:44)
To start a restore job, run manticore-backup with the flag --restore=backup name, where backup name is the name of the backup directory within the --backup-dir. Note that:
1. There can't be any Manticore instance running on the same host and port as the one being restored.
2. The old manticore.json file must not exist.
3. The old configuration file must not exist.
4. The old data directory must exist and be empty.
If all conditions are met, the restore will proceed. The tool will provide hints, so you don't have to memorize them. It's crucial to avoid overwriting existing files, so make sure to remove them prior to the restore if they still exist. Hence all the conditions.
manticore-backup --backup-dir=/mnt/backup/ --restore=backup-20221007104044
Copyright (c) 2023-2024, Manticore Software LTD (https://manticoresearch.com)
Manticore config file:
Backup dir: /tmp/
2022-10-07 11:17:25 [Info] Starting to restore...
Manticore config
endpoint = 127.0.0.1:9308
2022-10-07 11:17:25 [Info] Restoring config files...
2022-10-07 11:17:25 [Info] config files - OK
2022-10-07 11:17:25 [Info] Restoring state files...
2022-10-07 11:17:25 [Info] config files - OK
2022-10-07 11:17:25 [Info] Restoring data files...
2022-10-07 11:17:25 [Info] config files - OK
2022-10-07 11:17:25 [Info] The backup '/tmp/backup-20221007104044' was successfully restored.
2022-10-07 11:17:25 [Info] Elapsed time: 0.02s
2022-10-07 11:17:25 [Info] Done
To create a backup of your Manticore Search database, you can use the mysqldump command. We will use the default port and host in the examples.
Note, mysqldump is supported only for real-time tables.
mysqldump -h0 -P9306 manticore > manticore_backup.sql
mariadb-dump -h0 -P9306 manticore > manticore_backup.sql
If you're looking to restore a Manticore Search database from a backup file, the mysql client is your tool of choice.
Note, if you are restoring in Plain mode, you cannot drop and recreate tables directly. Therefore, you should:
- Use mysqldump with the -t option to exclude CREATE TABLE statements from your backup.
- Manually TRUNCATE the tables before proceeding with the restoration.
mysql -h0 -P9306 < manticore_backup.sql
mariadb -h0 -P9306 < manticore_backup.sql
Here are some more settings that can be used with mysqldump to tailor your backup:
--add-drop-table: This injects a DROP TABLE command before each CREATE TABLE command in the backup file.--no-data: This setting omits table data from the backup, leading to a backup file that consists of only table schemas.--ignore-table=[database_name].[table_name]: This option allows you to bypass a particular table during the backup operation. Note, the database name must be Manticore.For a comprehensive list of settings and their thorough descriptions, kindly refer to the official MySQL documentation.
We recommend specifying the manticore database explicitly when you plan to back up all databases, rather than using the --all-databases option.
Keep in mind that mysqldump does not support backing up distributed tables. Additionally, it cannot back up tables that contain non-stored fields (consider using manticore-backup or the BACKUP SQL command).
A plain table can be created from an external source using a special tool called indexer, which reads a "recipe" from the configuration, connects to the data sources, pulls documents, and builds table files. This is a lengthy process. If your data changes, the table becomes outdated, and you need to rebuild it from the refreshed sources. If your data changes incrementally, such as a blog or newsfeed where old documents never change and only new ones are added, the rebuild will take more and more time, as you will need to process the archive sources again and again with each pass.
One way to deal with this problem is by using several tables instead of one solid table. For example, you can process sources produced in previous years and save the table. Then, take only sources from the current year and put them into a separate table, rebuilding it as often as necessary. You can then place both tables as parts of a distributed table and use it for querying. The point here is that each time you rebuild, you only process data from the last 12 months at most, and the table with older data remains untouched without needing to be rebuilt. You can go further and divide the last 12 months table into monthly, weekly, or daily tables, and so on.
This approach works, but you need to maintain your distributed table manually. That is, add new chunks, delete old ones, and keep the overall number of partial tables not too large (with too many tables, searching can become slower, and the OS usually limits the number of simultaneously opened files). To deal with this, you can manually merge several tables together by running indexer --merge. However, that only solves the problem of having many tables, making maintenance more challenging. And even with 'per-hour' reindexing, you will most likely have a noticeable time gap between new data arriving in sources and rebuilding the table, which populates this data for searching.
A real-time table is designed to solve this problem. It consists of two parts:
This is very similar to a standard distributed table, made from several local tables.
You don't need to build such a table by running indexer, which reads a "recipe" from the config and tables data sources. Instead, the real-time table provides the ability to 'insert' new documents and 'replace' existing ones. When executing the 'insert' command, you push new documents to the server. It then builds a small table from the added documents and immediately brings it online. So, right after the 'insert' command completes, you can perform searches in all table parts, including the just-added documents.
The search server automatically maintains the table, so you don't have to worry about it. However, you might be interested in learning a few details about 'how it is maintained'.
First, since indexed data is stored in RAM - what about emergency power-off? Will I lose my table then? Well, before completion, the server saves new data into a special 'binlog'. This consists of one or several files, living on your persistent storage, which incrementally grows as you add more and more changes. You can adjust the behavior regarding how often new queries (or transactions) are stored in the binlog, and how often the 'sync' command is executed over the binlog file to force the OS to actually save the data on a safe storage. The most paranoid approach is to flush and sync after every transaction. This is the slowest but also the safest method. The least expensive way is to switch off the binlog entirely. This is the fastest method, but you risk losing your indexed data. Intermediate variants, like flush/sync every second, are also provided.
The binlog is designed specifically for sequential saving of newly arriving transactions; it is not a table and cannot be searched over. It is merely an insurance policy to ensure that the server will not lose your data. If a sudden disruption occurs and everything crashes due to a software or hardware problem, the server will load the freshest available dump of the RAM chunk and then replay the binlog, repeating stored transactions. Ultimately, it will achieve the same state as it was in at the moment of the last change.
Second, what about limits? What if I want to process, say, 10TB of data, but it just doesn't fit into RAM! RAM for a real-time table is limited and can be configured. When a certain amount of data is indexed, the server manages the RAM part of the table by merging together small transactions, keeping their number and overall size small. This process can sometimes cause delays during insertion, however. When merging no longer helps, and new insertions hit the RAM limit, the server converts the RAM-based table into a plain table stored on disk (called a disk chunk). This table is added to the collection of tables in the second part of the RT table and becomes accessible online. The RAM is then flushed, and the space is deallocated.
When the data from RAM is securely saved to disk, which occurs:
the binlog for that table is no longer necessary. So, it gets discarded. If all the tables are saved, the binlog will be deleted.
Third, what about disk collection? If having many disk parts makes searching slower, what's the difference if I make them manually in the distributed table manner, or they're produced as disk parts (or, 'chunks') by an RT table? Well, in both cases, you can merge several tables into one. For example, you can merge hourly tables from yesterday and keep one 'daily' table for yesterday instead. With manual maintenance, you have to think about the schema and commands yourself. With an RT table, the server provides the OPTIMIZE command, which does the same, but keeps you away from unnecessary internal details.
Fourth, if my "document" constitutes a 'mini-table' and I don't need it anymore, I can just throw it away. But if it is 'optimized', i.e. mixed together with tons of other documents, how can I undo or delete it? Yes, indexed documents are 'mixed' together, and there is no easy way to delete one without rebuilding the whole table. And if for plain tables rebuilding or merging is just a normal way of maintenance, for a real-time table it keeps only the simplicity of manipulation, but not 'real-timeness'. To address the problem, Manticore uses a trick: when you delete a document, identified by document ID, the server just tracks the number. Together with other deleted documents, their IDs are saved in a so-called kill-list. When you search over the table, the server first retrieves all matching documents, and then throws out the documents that are found in the kill-list (that is the most basic description; in fact, internally it's more complex). The point is - for the sake of 'immediate' deletion, documents are not actually deleted, but are just marked as 'deleted'. They still occupy space in different table structures, being essentially garbage. Word statistics, which affect ranking, also aren't affected, meaning it works exactly as it is declared: we search among all documents, and then just hide ones marked as deleted from the final result. When a document is replaced, it means that it is killed in the old parts of the table and is inserted again in the freshest part. All consequences of 'hiding by killlist' are also in play in this case.
When a rebuild of some part of a table happens, e.g., when some transactions (segments) of a RAM chunk are merged, or when a RAM chunk is converted into a disk chunk, or when two disk chunks are merged together, the server performs a comprehensive iteration over the affected parts and physically excludes deleted documents from all of them. That is, if they were in document lists of some words - they are wiped away. If it was a unique word - it gets removed completely.
As a summary: the deletion works in two phases:
1. First, we mark documents as 'deleted' in real-time and suppress them in search results.
2. During some operation with an RT table chunk, we finally physically wipe the deleted documents for good.
Fifth, if an RT table contains plain disk tables in its collection, can I just add my ready old disk table to it? No. It's not possible to avoid unneeded complexity and prevent accidental corruption. However, if your RT table has just been created and contains no data, then you can ATTACH TABLE your disk table to it. Your old table will be moved inside the RT table and will become its part.
As a summary about the RT table structure: it is a cleverly organized collection of plain disk tables with a fast in-memory table, intended for real-time insertions and semi-real-time deletions of documents. The RT table has a common schema, common settings, and can be easily maintained without deep digging into details.
FLUSH RAMCHUNK rt_table
The FLUSH RAMCHUNK command creates a new disk chunk in an RT table.
Normally, an RT table would automatically flush and convert the contents of the RAM chunk into a new disk chunk once the RAM chunk reaches the maximum allowed rt_mem_limit size. However, for debugging and testing purposes, it might be useful to forcibly create a new disk chunk, and the FLUSH RAMCHUNK statement does exactly that.
FLUSH RAMCHUNK rt;
Query OK, 0 rows affected (0.05 sec)
FLUSH TABLE rt_table
FLUSH TABLE forcefully flushes RT table RAM chunk contents to disk.
The real-time table RAM chunk is automatically flushed to disk during a clean shutdown, or periodically every rt_flush_period seconds.
Issuing a FLUSH TABLE command not only forces the RAM chunk contents to be written to disk but also triggers the cleanup of binary log files.
FLUSH TABLE rt;
Query OK, 0 rows affected (0.05 sec)
Over time, RT tables may become fragmented into numerous disk chunks and/or contaminated with deleted, yet unpurged data, affecting search performance. In these cases, optimization is necessary. Essentially, the optimization process combines pairs of disk chunks, removing documents that were previously deleted using DELETE statements.
Beginning with Manticore 4, this process occurs automatically by default. However, you can also use the following commands to manually initiate table compaction.
OPTIMIZE TABLE index_name [OPTION opt_name = opt_value [,...]]
OPTIMIZE statement adds an RT table to the optimization queue, which will be processed in a background thread.
OPTIMIZE TABLE rt;
By default, OPTIMIZE merges the RT table's disk chunks down to a number equal to # of CPU cores * 2. You can control the number of optimized disk chunks using the cutoff option.
Additional options include:
OPTIMIZE TABLE rt OPTION cutoff=4;
When using OPTION sync=1 (0 by default), the command will wait for the optimization process to complete before returning. If the connection is interrupted, the optimization will continue running on the server.
OPTIMIZE TABLE rt OPTION sync=1;
Optimization can be a lengthy and I/O-intensive process. To minimize the impact, all actual merge work is executed serially in a special background thread, and the OPTIMIZE statement simply adds a job to its queue. The optimization thread can be I/O-throttled, and you can control the maximum number of I/Os per second and the maximum I/O size with the rt_merge_iops and rt_merge_maxiosize directives, respectively.
During optimization, the RT table being optimized remains online and available for both searching and updates nearly all the time. It is locked for a very brief period when a pair of disk chunks is successfully merged, allowing for the renaming of old and new files and updating the table header.
As long as auto_optimize is not disabled, tables are optimized automatically.
If you are experiencing unexpected SSTs or want tables across all nodes of the cluster to be binary identical, you need to:
1. Disable auto_optimize.
2. Manually optimize tables:
On one of the nodes, drop the table from the cluster:
ALTER CLUSTER mycluster DROP myindex;
Optimize the table:
OPTIMIZE TABLE myindex;
Add back the table to the cluster:
ALTER CLUSTER mycluster ADD myindex;
When the table is added back, the new files created by the optimization process will be replicated to the other nodes in the cluster.
Any local changes made to the table on other nodes will be lost.
Table data modifications (inserts, replaces, deletes, updates) should either:
Note that while the table is out of the cluster, insert/replace/delete/update commands should refer to it without the cluster name prefix (for SQL statements or the cluster property in case of an HTTP JSON request), otherwise they will fail.
Once the table is added back to the cluster, you must resume write operations on the table and include the cluster name prefix again, or they will fail.
Search operations are available as usual during the process on any of the nodes.
Manticore provides isolation during the flushing and merging process of a real-time table to prevent any changes from affecting running queries.
For example, during table compaction, a pair of disk chunks are merged and a new chunk is produced. At one point, a new version of the table is created with the new chunk replacing the original pair. This is done seamlessly so that a long-running query using the original chunks will continue to see the old version of the table, while a new query will see the new version with the resulting merged chunk.
The same applies to flushing a RAM chunk, where suitable RAM segments are merged into a new disk chunk and the participated RAM chunk segments are abandoned. During this operation, Manticore provides isolation for queries that started before the operation began.
Furthermore, these operations are transparent for replaces and updates. If you update an attribute in a document that belongs to a disk chunk being merged with another one, the update will be applied to both that chunk and the resulting merged chunk. If you delete a document during a merge, it will be deleted in the original chunk and also in the resulting merged chunk, which will either have the document marked as deleted or have no such document at all if the deletion happened early in the merging process.
FREEZE tbl1[, tbl2, ...]
FREEZE readies a real-time/plain table for a secure backup. Specifically, it:
1. Deactivates table compaction. If the table is currently being compacted, FREEZE will interrupt it.
2. Transfers the current RAM chunk to a disk chunk.
3. Flushes attributes.
4. Disables implicit operations that could modify the disk files.
5. Shows the actual file list associated with the table.
The built-in tool manticore-backup uses FREEZE to ensure data consistency. You can do the same if you want to create your own backup solution or need to freeze tables for other reasons. Just follow these steps:
1. FREEZE a table (or a few).
2. Capture the output of the FREEZE command and back up the specified files.
3. UNFREEZE the table(s) once finished.
FREEZE t;
+-------------------+---------------------------------+
| file | normalized |
+-------------------+---------------------------------+
| data/t/t.0.spa | /work/anytest/data/t/t.0.spa |
| data/t/t.0.spd | /work/anytest/data/t/t.0.spd |
| data/t/t.0.spds | /work/anytest/data/t/t.0.spds |
| data/t/t.0.spe | /work/anytest/data/t/t.0.spe |
| data/t/t.0.sph | /work/anytest/data/t/t.0.sph |
| data/t/t.0.sphi | /work/anytest/data/t/t.0.sphi |
| data/t/t.0.spi | /work/anytest/data/t/t.0.spi |
| data/t/t.0.spm | /work/anytest/data/t/t.0.spm |
| data/t/t.0.spp | /work/anytest/data/t/t.0.spp |
| data/t/t.0.spt | /work/anytest/data/t/t.0.spt |
| data/t/t.meta | /work/anytest/data/t/t.meta |
| data/t/t.ram | /work/anytest/data/t/t.ram |
| data/t/t.settings | /work/anytest/data/t/t.settings |
+-------------------+---------------------------------+
13 rows in set (0.01 sec)
The file column indicates the table's file paths within the data_dir of the running instance. The normalized column displays the absolute paths for the same files. To back up a table, simply copy the provided files without additional preparation.
When a table is frozen, you cannot execute UPDATE queries; they will fail with the error message "index is locked now, try again later."
Also, DELETE and REPLACE queries have some restrictions while the table is frozen:
DELETE affects a document in the current RAM chunk - it is permitted.DELETE impacts a document in a disk chunk but was previously deleted - it is allowed.DELETE would alter an actual disk chunk - it will wait until the table is unfrozen.Manually FLUSHing a RAM chunk of a frozen table will report 'success', but no real saving will occur.
DROP/TRUNCATE of a frozen table is allowed since these operations are not implicit. We assume that if you truncate or drop a table, you don't need it backed up; therefore, it should not have been frozen initially.
INSERTing into a frozen table is supported but limited: new data will be stored in RAM (as usual) until rt_mem_limit is reached; then, new insertions will wait until the table is unfrozen.
If you shut down the daemon with a frozen table, it will act as if it experienced a dirty shutdown (e.g., kill -9): newly inserted data will not be saved in the RAM-chunk on disk, and upon restart, it will be restored from a binary log (if any) or lost (if binary logging is disabled).
UNFREEZE tbl1[, tbl2, ...]
UNFREEZE reactivates previously blocked operations and resumes the internal compaction service. All operations waiting for a table to unfreeze will also be unfrozen and complete normally.
UNFREEZE tbl;
FLUSH ATTRIBUTES
The FLUSH ATTRIBUTES command flushes all in-memory attribute updates in all the active disk tables to disk. It returns a tag that identifies the result on-disk state (which is basically a number of actual disk attribute saves performed since the server startup).
mysql> UPDATE testindex SET channel_id=1107025 WHERE id=1;
Query OK, 1 row affected (0.04 sec)
mysql> FLUSH ATTRIBUTES;
+------+
| tag |
+------+
| 1 |
+------+
1 row in set (0.19 sec)
FLUSH HOSTNAMES
The FLUSH HOSTNAMES command is used to renew IP addresses associated with agent host names. If you want to always query the DNS for getting the host name IP, you can use the hostname_lookup directive.
mysql> FLUSH HOSTNAMES;
Query OK, 5 rows affected (0.01 sec)
In many cases, you might want to encrypt traffic between your client and the server. To do that, you can specify that the server should use the HTTPS protocol rather than HTTP.
To enable HTTPS, at least the following two directives should be set in the searchd section of the config, and there should be at least one listener set to https
In addition to that, you can specify the certificate authority's certificate (aka root certificate) in:
ssl_ca = ca-cert.pem
ssl_cert = server-cert.pem
ssl_key = server-key.pem
ssl_cert = server-cert.pem
ssl_key = server-key.pem
These steps will help you generate the SSL certificates using the 'openssl' tool.
The server can use a Certificate Authority to verify the signature of certificates, but it can also work with just a private key and certificate (without the CA certificate).
openssl genrsa 2048 > ca-key.pem
To generate a self-signed CA (root) certificate from the private key (make sure to fill in at least the "Common Name"), use the following command:
openssl req -new -x509 -nodes -days 365 -key ca-key.pem -out ca-cert.pem
The server uses the server certificate to secure communication with the client. To generate the certificate request and server private key (ensure that you fill in at least the "Common Name" and that it is different from the root certificate's common name), execute the following commands:
openssl req -newkey rsa:2048 -days 365 -nodes -keyout server-key.pem -out server-req.pem
openssl rsa -in server-key.pem -out server-key.pem
openssl x509 -req -in server-req.pem -days 365 -CA ca-cert.pem -CAkey ca-key.pem -set_serial 01 -out server-cert.pem
Once completed, you can verify that the key and certificate files were generated correctly by running:
openssl verify -CAfile ca-cert.pem server-cert.pem
When your SSL configuration is valid, the following features are available:
mysql client tries to use SSL by default, so a typical connection to Manticore with a valid SSL configuration will most likely be secured. You can check this by running the SQL 'status' command after connecting.If your SSL configuration is not valid for any reason (which the daemon detects by the fact that a secured connection cannot be established), apart from an invalid configuration there may be other reasons, such as the inability to load the appropriate SSL library at all. In this case, the following things will not work or will work in a non-secured manner:
https port. The HTTPS connections will be droppedmysql port via a MySQL client will not support SSL securing. If the client requires SSL, the connection will fail. If SSL is not required, it will use plain MySQL or compressed connections.Read-only mode for a connection disables any table or global modifications. Therefore, queries like create, drop, various types of alter, attach, optimize, and data modification queries such as insert, replace, delete, update, and others will all be rejected. Changing daemon-wide settings using SET GLOBAL is also not possible in this mode.
However, you can still perform all search operations, generate snippets, and run CALL PQ queries. Additionally, you can modify local (connection-wide) settings.
To check if your current connection is read-only or not, execute the show variables like 'session_read_only' statement. A value of 1 indicates read-only, while 0 means not read-only (usual).
Typically, you define a separate listen directive in read-only mode by adding the suffix _readonly to it. However, you can also do this interactively for the current connection by executing the SET ro=1 statement via SQL.
If you're connected to a VIP socket, you can execute SET ro=0 (even if the socket you are connected to was defined as read-only in the config and not interactively). This will switch the connection to the usual (not read-only) mode with all modifications allowed.
For standard (non-VIP) connections, escaping read-only mode is only possible by reconnecting if it was set interactively, or by updating the configuration file and restarting the daemon.
Query logging can be enabled by setting the query_log directive in the searchd section of the configuration file.
Queries can also be sent to syslog by setting syslog instead of a file path. In this case, all search queries will be sent to the syslog daemon with LOG_INFO priority, prefixed with [query] instead of a timestamp. Only the plain log format is supported for syslog.
query_log example:
searchd {
...
query_log = /var/log/query.log
query_log_format = sphinxql # default
...
}
Two query log formats are supported:
sphinxql (default): Logs in SQL format. It also provides an easy way to replay logged queries.plain: Logs full-text queries in a simple text format. Recommended if most of your queries are primarily full-text, or if you don't care about non-full-text components of your queries, such as filtering by attributes, sorting, grouping, etc. Queries logged in the plain format cannot be replayed.To switch between the formats, you can use the searchd setting query_log_format.
The SQL log format is the default setting. In this mode, Manticore logs all successful and unsuccessful select queries. Requests sent as SQL or via the binary API are logged in the SQL format, but JSON queries are logged as is. This type of logging only works with plain log files and does not support the 'syslog' service for logging.
query_log_format example:
query_log_format = sphinxql # default
The features of the Manticore SQL log format compared to the plain format include:
sphinxql log entries example:
/* Sun Apr 28 12:38:02.808 2024 conn 2 (127.0.0.1:53228) real 0.000 wall 0.000 found 0 */ SELECT * FROM test WHERE MATCH('test') OPTION ranker=proximity;
/* Sun Apr 28 12:38:05.585 2024 conn 2 (127.0.0.1:53228) real 0.001 wall 0.001 found 0 */ SELECT * FROM test WHERE MATCH('test') GROUP BY channel_id OPTION ranker=proximity;
/* Sun Apr 28 12:40:57.366 2024 conn 4 (127.0.0.1:53256) real 0.000 wall 0.000 found 0 */ /*{
"index" : "test",
"query":
{
"match":
{
"*" : "test"
}
},
"_source": ["f"],
"limit": 30
} */
With the plain log format, Manticore logs all successfully executed search queries in a simple text format. Non-full-text parts of queries are not logged. JSON queries are logged as flattened to a single line.
query_log_format example:
query_log_format = plain
The log format is as follows:
[query-date] real-time wall-time [match-mode/filters-count/sort-mode total-matches (offset,limit) @groupby-attr] [table-name] {perf-stats} query
where:
real-time is the time from the start to the finish of the query.wall-time is similar to real-time, but excludes time spent waiting for agents and merging result sets from them.perf-stats includes CPU/IO stats when Manticore is started with --cpustats (or it was enabled via SET GLOBAL cpustats=1) and/or --iostats (or it was enabled via SET GLOBAL iostats=1):ios is the number of file I/O operations carried out;kb is the amount of data in kilobytes read from the table files;ms is the time spent on I/O operations.cpums is the time in milliseconds spent on CPU processing the query.
match-mode can have one of the following values:
SPH_MATCH_ALL mode;SPH_MATCH_ANY mode;SPH_MATCH_PHRASE mode;SPH_MATCH_BOOLEAN mode;SPH_MATCH_EXTENDED mode;SPH_MATCH_EXTENDED2 mode;"scan" if the full scan mode was used, either by being specified with SPH_MATCH_FULLSCAN or if the query was empty.
sort-mode can have one of the following values:
SPH_SORT_RELEVANCE mode;SPH_SORT_ATTR_DESC mode;SPH_SORT_ATTR_ASC mode;SPH_SORT_TIME_SEGMENTS mode;SPH_SORT_EXTENDED mode.Note: the SPH* modes are specific to the sphinx legacy interface. SQL and JSON interfaces will log, in most cases, ext2 as match-mode and ext and rel as sort-mode.
Query log example:
[Fri Jun 29 21:17:58 2021] 0.004 sec [all/0/rel 35254 (0,20)] [lj] [ios=6 kb=111.1 ms=0.5] test
[Fri Jun 29 21:17:58 2021] 0.004 sec [all/0/rel 35254 (0,20)] [lj] [ios=6 kb=111.1 ms=0.5 cpums=0.3] test
[Sun Apr 28 15:09:38.712 2024] 0.000 sec 0.000 sec [ext2/0/ext 0 (0,20)] [test] test
[Sun Apr 28 15:09:44.974 2024] 0.000 sec 0.000 sec [ext2/0/ext 0 (0,20) @channel_id] [test] test
[Sun Apr 28 15:24:32.975 2024] 0.000 sec 0.000 sec [ext2/0/ext 0 (0,30)] [test] { "index" : "test", "query": { "match": { "*" : "test" } }, "_source": ["f"], "limit": 30 }
By default, all queries are logged. If you want to log only queries with execution times exceeding a specified limit, the query_log_min_msec directive can be used.
The expected unit of measurement is milliseconds, but time suffix expressions can also be used.
query_log_min_msec example:
searchd {
...
query_log = /var/log/query.log
query_log_min_msec = 1000
# query_log_min_msec = 1s
...
}
By default, the searchd and query log files are created with permission 600, so only the user under which Manticore is running and root can read the log files. The query_log_mode option allows setting a different permission. This can be helpful for allowing other users to read the log files (for example, monitoring solutions running on non-root users).
query_log_mode example:
searchd {
...
query_log = /var/log/query.log
query_log_mode = 666
...
}
By default, Manticore search daemon logs all runtime events in a searchd.log file in the directory where searchd was started. In Linux by default, you can find the log at /var/log/manticore/searchd.log.
The log file path/name can be overridden by setting log in the searchd section of the configuration file.
searchd {
...
log = /custom/path/to/searchd.log
...
}
syslog as the file name. In this case, events will be sent to your server's syslog daemon./dev/stdout as the file name. In this case, on Linux, Manticore will simply output the events. This can be useful in Docker/Kubernetes environments.Binary logging serves as a recovery mechanism for Real-Time table data, as well as attribute updates for plain tables that would otherwise only be stored in RAM until a flush occurs. When binary logs are enabled, searchd records each transaction to the binlog file and utilizes it for recovery following an unclean shutdown. During a clean shutdown, RAM chunks are saved to disk, and all binlog files are subsequently unlinked.
By default, binary logging is enabled. On Linux systems, the default location for binlog.* files is /var/lib/manticore/data/.
In RT mode, binary logs are stored in the data_dir folder, unless specified otherwise.
To disable binary logging, set binlog_path to empty:
searchd {
...
binlog_path = # disable logging
...
Disabling binary logging can lead to better performance for Real-Time tables, but it also puts their data at risk.
You can use the following directive to set a custom path:
searchd {
...
binlog_path = /var/data
...
When logging is enabled, each transaction committed to an RT table is written to a log file. After an unclean shutdown, logs are automatically replayed upon startup, recovering any logged changes.
During normal operation, a new binlog file is opened whenever the binlog_max_log_size limit is reached. Older, closed binlog files are retained until all transactions stored in them (from all tables) are flushed as a disk chunk. Setting the limit to 0 essentially prevents the binlog from being unlinked while searchd is running; however, it will still be unlinked upon a clean shutdown. By default, there is no limit to the log file size.
binlog_max_log_size = 16M
There are 3 different binlog flushing strategies, controlled by the binlog_flush directive:
The default mode is to flush every transaction and sync every second (mode 2).
searchd {
...
binlog_flush = 1 # ultimate safety, low speed
...
}
During recovery after an unclean shutdown, binlogs are replayed, and all logged transactions since the last good on-disk state are restored. Transactions are checksummed, so in case of binlog file corruption, garbage data will not be replayed; such a broken transaction will be detected and will stop the replay.
Intensive updates to a small RT table that fully fits into a RAM chunk can result in an ever-growing binlog that can never be unlinked until a clean shutdown. Binlogs essentially serve as append-only deltas against the last known good saved state on disk, and they cannot be unlinked unless the RAM chunk is saved. An ever-growing binlog is not ideal for disk usage and crash recovery time. To address this issue, you can configure searchd to perform periodic RAM chunk flushes using the rt_flush_period directive. With periodic flushes enabled, searchd will maintain a separate thread that checks whether RT table RAM chunks need to be written back to disk. Once this occurs, the respective binlogs can be (and are) safely unlinked.
searchd {
...
rt_flush_period = 3600 # 1 hour
...
}
The default RT flush period is set to 10 hours.
It's important to note that rt_flush_period only controls the frequency at which checks occur. There are no guarantees that a specific RAM chunk will be saved. For example, it doesn't make sense to regularly re-save a large RAM chunk that only receives a few rows worth of updates. Manticore automatically determines whether to perform the flush using a few heuristics.
When you use the official Manticore docker image, the server log is sent to /dev/stdout which can be viewed from host with:
docker logs manticore
The query log can be diverted to the Docker log by passing the variable QUERY_LOG_TO_STDOUT=true.
The log folder is the same as in the case of the Linux package, set to /var/log/manticore. If desired, it can be mounted to a local path to view or process the logs.
Manticore Search accepts the USR1 signal for reopening server and query log files.
The official DEB and RPM packages install a Logrotate configuration file for all files in the default log folder.
A simple logrotate configuration for log files looks like:
/var/log/manticore/*.log {
weekly
rotate 10
copytruncate
delaycompress
compress
notifempty
missingok
}
mysql> FLUSH LOGS;
Query OK, 0 rows affected (0.01 sec)
Additionally, the FLUSH LOGS SQL command is available, which works the same way as the USR1 system signal. It initiates the reopening of searchd log and query log files, allowing you to implement log file rotation. The command is non-blocking (i.e., it returns immediately).
The easiest way to view high-level information about your Manticore node is by running status in the MySQL client. It will display information about various aspects, such as:
clients)mysql> status
--------------
mysql Ver 14.14 Distrib 5.7.30, for Linux (x86_64) using EditLine wrapper
Connection id: 378
Current database: Manticore
Current user: Usual
SSL: Not in use
Current pager: stdout
Using outfile: ''
Using delimiter: ;
Server version: 3.4.3 a48c61d6@200702 coroutines git branch coroutines_work_junk...origin/coroutines_work_junk
Protocol version: 10
Connection: 0 via TCP/IP
Server characterset:
Db characterset:
Client characterset: utf8
Conn. characterset: utf8
TCP port: 8306
Uptime: 23 hours 6 sec
Threads: 12 Queue: 3 Clients: 1 Vip clients: 0 Tasks: 5 Queries: 318967 Wall: 7h CPU: 0us
Queue/Th: 0.2 Tasks/Th: 0.4
--------------
SHOW STATUS [ LIKE pattern ]
SHOW STATUS is an SQL statement that presents various helpful performance counters. IO and CPU counters will only be available if searchd was started with the --iostats and --cpustats switches, respectively (or if they were enabled via SET GLOBAL iostats/cpustats=1).
SHOW STATUS;
+-----------------------+---------------------------+
| Counter | Value |
+-----------------------+---------------------------+
| uptime | 1385 |
| connections | 11 |
| maxed_out | 0 |
| version | 3.4.3 ab7cbe5d@200511 dev |
| mysql_version | 3.4.3 ab7cbe5d@200511 dev |
| command_search | 2 |
| command_excerpt | 0 |
| command_update | 0 |
| command_delete | 0 |
| command_keywords | 0 |
| command_persist | 0 |
| command_status | 1 |
| command_flushattrs | 0 |
| command_set | 1 |
| command_insert | 0 |
| command_replace | 0 |
| command_commit | 0 |
| command_suggest | 0 |
| command_json | 0 |
| command_callpq | 0 |
| agent_connect | 0 |
| agent_retry | 0 |
| queries | 12 |
| dist_queries | 0 |
| workers_total | 30 |
| workers_active | 1 |
| workers_clients | 0 |
| workers_clients_vip | 1 |
| work_queue_length | 1 |
| query_wall | 10.805 |
| query_cpu | OFF |
| dist_wall | 0.000 |
| dist_local | 0.000 |
| dist_wait | 0.000 |
| query_reads | OFF |
| query_readkb | OFF |
| query_readtime | OFF |
| avg_query_wall | 0.900 |
| avg_query_cpu | OFF |
| avg_dist_wall | 0.000 |
| avg_dist_local | 0.000 |
| avg_dist_wait | 0.000 |
| avg_query_reads | OFF |
| avg_query_readkb | OFF |
| avg_query_readtime | OFF |
| qcache_max_bytes | 0 |
| qcache_thresh_msec | 3000 |
| qcache_ttl_sec | 60 |
| qcache_cached_queries | 0 |
| qcache_used_bytes | 0 |
| qcache_hits | 0 |
+-----------------------+---------------------------+
49 rows in set (0.00 sec)
An optional LIKE clause is supported, allowing you to select only the variables that match a specific pattern. The pattern syntax follows standard SQL wildcards, where % represents any number of any characters, and _ represents a single character.
SHOW STATUS LIKE 'qcache%';
+-----------------------+-------+
| Counter | Value |
+-----------------------+-------+
| qcache_max_bytes | 0 |
| qcache_thresh_msec | 3000 |
| qcache_ttl_sec | 60 |
| qcache_cached_queries | 0 |
| qcache_used_bytes | 0 |
| qcache_hits | 0 |
+-----------------------+-------+
6 rows in set (0.00 sec)
SHOW SETTINGS is an SQL statement that displays the current settings from your configuration file. The setting names are represented in the following format: 'config_section_name'.'setting_name'
The result also includes two additional values:
- configuration_file - The path to the configuration file
- worker_pid - The process ID of the running searchd instance
SHOW SETTINGS;
+--------------------------+-------------------------------------+
| Setting_name | Value |
+--------------------------+-------------------------------------+
| configuration_file | /etc/manticoresearch/manticore.conf |
| worker_pid | 658 |
| searchd.listen | 0.0.0:9312 |
| searchd.listen | 0.0.0:9306:mysql |
| searchd.listen | 0.0.0:9308:http |
| searchd.log | /var/log/manticore/searchd.log |
| searchd.query_log | /var/log/manticore/query.log |
| searchd.pid_file | /var/run/manticore/searchd.pid |
| searchd.data_dir | /var/lib/manticore |
| searchd.query_log_format | sphinxql |
| searchd.binlog_path | /var/lib/manticore/binlog |
| common.plugin_dir | /usr/local/lib/manticore |
| common.lemmatizer_base | /usr/share/manticore/morph/ |
+--------------------------+-------------------------------------+
13 rows in set (0.00 sec)
SHOW AGENT ['agent_or_index'] STATUS [ LIKE pattern ]
SHOW AGENT STATUS displays the statistics of remote agents or a distributed table. It includes values such as the age of the last request, last answer, the number of various types of errors and successes, and so on. Statistics are displayed for every agent for the last 1, 5, and 15 intervals, each consisting of ha_period_karma seconds.
SHOW AGENT STATUS;
+------------------------------------+----------------------------+
| Variable_name | Value |
+------------------------------------+----------------------------+
| status_period_seconds | 60 |
| status_stored_periods | 15 |
| ag_0_hostname | 192.168.0.202:6713 |
| ag_0_references | 2 |
| ag_0_lastquery | 0.41 |
| ag_0_lastanswer | 0.19 |
| ag_0_lastperiodmsec | 222 |
| ag_0_pingtripmsec | 10.521 |
| ag_0_errorsarow | 0 |
| ag_0_1periods_query_timeouts | 0 |
| ag_0_1periods_connect_timeouts | 0 |
| ag_0_1periods_connect_failures | 0 |
| ag_0_1periods_network_errors | 0 |
| ag_0_1periods_wrong_replies | 0 |
| ag_0_1periods_unexpected_closings | 0 |
| ag_0_1periods_warnings | 0 |
| ag_0_1periods_succeeded_queries | 27 |
| ag_0_1periods_msecsperquery | 232.31 |
| ag_0_5periods_query_timeouts | 0 |
| ag_0_5periods_connect_timeouts | 0 |
| ag_0_5periods_connect_failures | 0 |
| ag_0_5periods_network_errors | 0 |
| ag_0_5periods_wrong_replies | 0 |
| ag_0_5periods_unexpected_closings | 0 |
| ag_0_5periods_warnings | 0 |
| ag_0_5periods_succeeded_queries | 146 |
| ag_0_5periods_msecsperquery | 231.83 |
| ag_1_hostname | 192.168.0.202:6714 |
| ag_1_references | 2 |
| ag_1_lastquery | 0.41 |
| ag_1_lastanswer | 0.19 |
| ag_1_lastperiodmsec | 220 |
| ag_1_pingtripmsec | 10.004 |
| ag_1_errorsarow | 0 |
| ag_1_1periods_query_timeouts | 0 |
| ag_1_1periods_connect_timeouts | 0 |
| ag_1_1periods_connect_failures | 0 |
| ag_1_1periods_network_errors | 0 |
| ag_1_1periods_wrong_replies | 0 |
| ag_1_1periods_unexpected_closings | 0 |
| ag_1_1periods_warnings | 0 |
| ag_1_1periods_succeeded_queries | 27 |
| ag_1_1periods_msecsperquery | 231.24 |
| ag_1_5periods_query_timeouts | 0 |
| ag_1_5periods_connect_timeouts | 0 |
| ag_1_5periods_connect_failures | 0 |
| ag_1_5periods_network_errors | 0 |
| ag_1_5periods_wrong_replies | 0 |
| ag_1_5periods_unexpected_closings | 0 |
| ag_1_5periods_warnings | 0 |
| ag_1_5periods_succeeded_queries | 146 |
| ag_1_5periods_msecsperquery | 230.85 |
+------------------------------------+----------------------------+
50 rows in set (0.01 sec)
$client->nodes()->agentstatus();
Array(
[status_period_seconds] => 60
[status_stored_periods] => 15
[ag_0_hostname] => 192.168.0.202:6713
[ag_0_references] => 2
[ag_0_lastquery] => 0.41
[ag_0_lastanswer] => 0.19
[ag_0_lastperiodmsec] => 222
[ag_0_errorsarow] => 0
[ag_0_1periods_query_timeouts] => 0
[ag_0_1periods_connect_timeouts] => 0
[ag_0_1periods_connect_failures] => 0
[ag_0_1periods_network_errors] => 0
[ag_0_1periods_wrong_replies] => 0
[ag_0_1periods_unexpected_closings] => 0
[ag_0_1periods_warnings] => 0
[ag_0_1periods_succeeded_queries] => 27
[ag_0_1periods_msecsperquery] => 232.31
[ag_0_5periods_query_timeouts] => 0
[ag_0_5periods_connect_timeouts] => 0
[ag_0_5periods_connect_failures] => 0
[ag_0_5periods_network_errors] => 0
[ag_0_5periods_wrong_replies] => 0
[ag_0_5periods_unexpected_closings] => 0
[ag_0_5periods_warnings] => 0
[ag_0_5periods_succeeded_queries] => 146
[ag_0_5periods_msecsperquery] => 231.83
[ag_1_hostname 192.168.0.202:6714
[ag_1_references] => 2
[ag_1_lastquery] => 0.41
[ag_1_lastanswer] => 0.19
[ag_1_lastperiodmsec] => 220
[ag_1_errorsarow] => 0
[ag_1_1periods_query_timeouts] => 0
[ag_1_1periods_connect_timeouts] => 0
[ag_1_1periods_connect_failures] => 0
[ag_1_1periods_network_errors] => 0
[ag_1_1periods_wrong_replies] => 0
[ag_1_1periods_unexpected_closings] => 0
[ag_1_1periods_warnings] => 0
[ag_1_1periods_succeeded_queries] => 27
[ag_1_1periods_msecsperquery] => 231.24
[ag_1_5periods_query_timeouts] => 0
[ag_1_5periods_connect_timeouts] => 0
[ag_1_5periods_connect_failures] => 0
[ag_1_5periods_network_errors] => 0
[ag_1_5periods_wrong_replies] => 0
[ag_1_5periods_unexpected_closings
[ag_1_5periods_warnings] => 0
[ag_1_5periods_succeeded_queries] => 146
[ag_1_5periods_msecsperquery] => 230.85
)
utilsApi.sql('SHOW AGENT STATUS')
{u'columns': [{u'Key': {u'type': u'string'}},
{u'Value': {u'type': u'string'}}],
u'data': [
{u'Key': u'status_period_seconds', u'Value': u'60'},
{u'Key': u'status_stored_periods', u'Value': u'15'},
{u'Key': u'ag_0_hostname', u'Value': u'192.168.0.202:6713'},
{u'Key': u'ag_0_references', u'Value': u'2'},
{u'Key': u'ag_0_lastquery', u'Value': u'0.41'},
{u'Key': u'ag_0_lastanswer', u'Value': u'0.19'},
{u'Key': u'ag_0_lastperiodmsec', u'Value': u'222'},
{u'Key': u'ag_0_errorsarow', u'Value': u'0'},
{u'Key': u'ag_0_1periods_query_timeouts', u'Value': u'0'},
{u'Key': u'ag_0_1periods_connect_timeouts', u'Value': u'0'},
{u'Key': u'ag_0_1periods_connect_failures', u'Value': u'0'},
{u'Key': u'ag_0_1periods_network_errors', u'Value': u'0'},
{u'Key': u'ag_0_1periods_wrong_replies', u'Value': u'0'},
{u'Key': u'ag_0_1periods_unexpected_closings', u'Value': u'0'},
{u'Key': u'ag_0_1periods_warnings', u'Value': u'0'},
{u'Key': u'ag_0_1periods_succeeded_queries', u'Value': u'27'},
{u'Key': u'ag_0_1periods_msecsperquery', u'Value': u'232.31'},
{u'Key': u'ag_0_5periods_query_timeouts', u'Value': u'0'},
{u'Key': u'ag_0_5periods_connect_timeouts', u'Value': u'0'},
{u'Key': u'ag_0_5periods_connect_failures', u'Value': u'0'},
{u'Key': u'ag_0_5periods_network_errors', u'Value': u'0'},
{u'Key': u'ag_0_5periods_wrong_replies', u'Value': u'0'},
{u'Key': u'ag_0_5periods_unexpected_closings', u'Value': u'0'},
{u'Key': u'ag_0_5periods_warnings', u'Value': u'0'},
{u'Key': u'ag_0_5periods_succeeded_queries', u'Value': u'146'},
{u'Key': u'ag_0_5periods_msecsperquery', u'Value': u'231.83'},
{u'Key': u'ag_1_hostname 192.168.0.202:6714'},
{u'Key': u'ag_1_references', u'Value': u'2'},
{u'Key': u'ag_1_lastquery', u'Value': u'0.41'},
{u'Key': u'ag_1_lastanswer', u'Value': u'0.19'},
{u'Key': u'ag_1_lastperiodmsec', u'Value': u'220'},
{u'Key': u'ag_1_errorsarow', u'Value': u'0'},
{u'Key': u'ag_1_1periods_query_timeouts', u'Value': u'0'},
{u'Key': u'ag_1_1periods_connect_timeouts', u'Value': u'0'},
{u'Key': u'ag_1_1periods_connect_failures', u'Value': u'0'},
{u'Key': u'ag_1_1periods_network_errors', u'Value': u'0'},
{u'Key': u'ag_1_1periods_wrong_replies', u'Value': u'0'},
{u'Key': u'ag_1_1periods_unexpected_closings', u'Value': u'0'},
{u'Key': u'ag_1_1periods_warnings', u'Value': u'0'},
{u'Key': u'ag_1_1periods_succeeded_queries', u'Value': u'27'},
{u'Key': u'ag_1_1periods_msecsperquery', u'Value': u'231.24'},
{u'Key': u'ag_1_5periods_query_timeouts', u'Value': u'0'},
{u'Key': u'ag_1_5periods_connect_timeouts', u'Value': u'0'},
{u'Key': u'ag_1_5periods_connect_failures', u'Value': u'0'},
{u'Key': u'ag_1_5periods_network_errors', u'Value': u'0'},
{u'Key': u'ag_1_5periods_wrong_replies', u'Value': u'0'},
{u'Key': u'ag_1_5periods_warnings', u'Value': u'0'},
{u'Key': u'ag_1_5periods_succeeded_queries', u'Value': u'146'},
{u'Key': u'ag_1_5periods_msecsperquery', u'Value': u'230.85'}],
u'error': u'',
u'total': 0,
u'warning': u''}
res = await utilsApi.sql("SHOW AGENT STATUS");
{"columns": [{"Key": {"type": "string"}},
{"Value": {"type": "string"}}],
"data": [
{"Key": "status_period_seconds", "Value": "60"},
{"Key": "status_stored_periods", "Value": "15"},
{"Key": "ag_0_hostname", "Value": "192.168.0.202:6713"},
{"Key": "ag_0_references", "Value": "2"},
{"Key": "ag_0_lastquery", "Value": "0.41"},
{"Key": "ag_0_lastanswer", "Value": "0.19"},
{"Key": "ag_0_lastperiodmsec", "Value": "222"},
{"Key": "ag_0_errorsarow", "Value": "0"},
{"Key": "ag_0_1periods_query_timeouts", "Value": "0"},
{"Key": "ag_0_1periods_connect_timeouts", "Value": "0"},
{"Key": "ag_0_1periods_connect_failures", "Value": "0"},
{"Key": "ag_0_1periods_network_errors", "Value": "0"},
{"Key": "ag_0_1periods_wrong_replies", "Value": "0"},
{"Key": "ag_0_1periods_unexpected_closings", "Value": "0"},
{"Key": "ag_0_1periods_warnings", "Value": "0"},
{"Key": "ag_0_1periods_succeeded_queries", "Value": "27"},
{"Key": "ag_0_1periods_msecsperquery", "Value": "232.31"},
{"Key": "ag_0_5periods_query_timeouts", "Value": "0"},
{"Key": "ag_0_5periods_connect_timeouts", "Value": "0"},
{"Key": "ag_0_5periods_connect_failures", "Value": "0"},
{"Key": "ag_0_5periods_network_errors", "Value": "0"},
{"Key": "ag_0_5periods_wrong_replies", "Value": "0"},
{"Key": "ag_0_5periods_unexpected_closings", "Value": "0"},
{"Key": "ag_0_5periods_warnings", "Value": "0"},
{"Key": "ag_0_5periods_succeeded_queries", "Value": "146"},
{"Key": "ag_0_5periods_msecsperquery", "Value": "231.83"},
{"Key": "ag_1_hostname 192.168.0.202:6714"},
{"Key": "ag_1_references", "Value": "2"},
{"Key": "ag_1_lastquery", "Value": "0.41"},
{"Key": "ag_1_lastanswer", "Value": "0.19"},
{"Key": "ag_1_lastperiodmsec", "Value": "220"},
{"Key": "ag_1_errorsarow", "Value": "0"},
{"Key": "ag_1_1periods_query_timeouts", "Value": "0"},
{"Key": "ag_1_1periods_connect_timeouts", "Value": "0"},
{"Key": "ag_1_1periods_connect_failures", "Value": "0"},
{"Key": "ag_1_1periods_network_errors", "Value": "0"},
{"Key": "ag_1_1periods_wrong_replies", "Value": "0"},
{"Key": "ag_1_1periods_unexpected_closings", "Value": "0"},
{"Key": "ag_1_1periods_warnings", "Value": "0"},
{"Key": "ag_1_1periods_succeeded_queries", "Value": "27"},
{"Key": "ag_1_1periods_msecsperquery", "Value": "231.24"},
{"Key": "ag_1_5periods_query_timeouts", "Value": "0"},
{"Key": "ag_1_5periods_connect_timeouts", "Value": "0"},
{"Key": "ag_1_5periods_connect_failures", "Value": "0"},
{"Key": "ag_1_5periods_network_errors", "Value": "0"},
{"Key": "ag_1_5periods_wrong_replies", "Value": "0"},
{"Key": "ag_1_5periods_warnings", "Value": "0"},
{"Key": "ag_1_5periods_succeeded_queries", "Value": "146"},
{"Key": "ag_1_5periods_msecsperquery", "Value": "230.85"}],
"error": "",
"total": 0,
"warning": ""}
utilsApi.sql("SHOW AGENT STATUS");
{columns=[{ Key : { type=string }},
{ Value : { type=string }}],
data : [
{ Key=status_period_seconds , Value=60 },
{ Key=status_stored_periods , Value=15 },
{ Key=ag_0_hostname , Value=192.168.0.202:6713 },
{ Key=ag_0_references , Value=2 },
{ Key=ag_0_lastquery , Value=0.41 },
{ Key=ag_0_lastanswer , Value=0.19 },
{ Key=ag_0_lastperiodmsec , Value=222 },
{ Key=ag_0_errorsarow , Value=0 },
{ Key=ag_0_1periods_query_timeouts , Value=0 },
{ Key=ag_0_1periods_connect_timeouts , Value=0 },
{ Key=ag_0_1periods_connect_failures , Value=0 },
{ Key=ag_0_1periods_network_errors , Value=0 },
{ Key=ag_0_1periods_wrong_replies , Value=0 },
{ Key=ag_0_1periods_unexpected_closings , Value=0 },
{ Key=ag_0_1periods_warnings , Value=0 },
{ Key=ag_0_1periods_succeeded_queries , Value=27 },
{ Key=ag_0_1periods_msecsperquery , Value=232.31 },
{ Key=ag_0_5periods_query_timeouts , Value=0 },
{ Key=ag_0_5periods_connect_timeouts , Value=0 },
{ Key=ag_0_5periods_connect_failures , Value=0 },
{ Key=ag_0_5periods_network_errors , Value=0 },
{ Key=ag_0_5periods_wrong_replies , Value=0 },
{ Key=ag_0_5periods_unexpected_closings , Value=0 },
{ Key=ag_0_5periods_warnings , Value=0 },
{ Key=ag_0_5periods_succeeded_queries , Value=146 },
{ Key=ag_0_5periods_msecsperquery , Value=231.83 },
{ Key=ag_1_hostname 192.168.0.202:6714 },
{ Key=ag_1_references , Value=2 },
{ Key=ag_1_lastquery , Value=0.41 },
{ Key=ag_1_lastanswer , Value=0.19 },
{ Key=ag_1_lastperiodmsec , Value=220 },
{ Key=ag_1_errorsarow , Value=0 },
{ Key=ag_1_1periods_query_timeouts , Value=0 },
{ Key=ag_1_1periods_connect_timeouts , Value=0 },
{ Key=ag_1_1periods_connect_failures , Value=0 },
{ Key=ag_1_1periods_network_errors , Value=0 },
{ Key=ag_1_1periods_wrong_replies , Value=0 },
{ Key=ag_1_1periods_unexpected_closings , Value=0 },
{ Key=ag_1_1periods_warnings , Value=0 },
{ Key=ag_1_1periods_succeeded_queries , Value=27 },
{ Key=ag_1_1periods_msecsperquery , Value=231.24 },
{ Key=ag_1_5periods_query_timeouts , Value=0 },
{ Key=ag_1_5periods_connect_timeouts , Value=0 },
{ Key=ag_1_5periods_connect_failures , Value=0 },
{ Key=ag_1_5periods_network_errors , Value=0 },
{ Key=ag_1_5periods_wrong_replies , Value=0 },
{ Key=ag_1_5periods_warnings , Value=0 },
{ Key=ag_1_5periods_succeeded_queries , Value=146 },
{ Key=ag_1_5periods_msecsperquery , Value=230.85 }],
error= ,
total=0,
warning= }
utilsApi.Sql("SHOW AGENT STATUS");
{columns=[{ Key : { type=string }},
{ Value : { type=string }}],
data : [
{ Key=status_period_seconds , Value=60 },
{ Key=status_stored_periods , Value=15 },
{ Key=ag_0_hostname , Value=192.168.0.202:6713 },
{ Key=ag_0_references , Value=2 },
{ Key=ag_0_lastquery , Value=0.41 },
{ Key=ag_0_lastanswer , Value=0.19 },
{ Key=ag_0_lastperiodmsec , Value=222 },
{ Key=ag_0_errorsarow , Value=0 },
{ Key=ag_0_1periods_query_timeouts , Value=0 },
{ Key=ag_0_1periods_connect_timeouts , Value=0 },
{ Key=ag_0_1periods_connect_failures , Value=0 },
{ Key=ag_0_1periods_network_errors , Value=0 },
{ Key=ag_0_1periods_wrong_replies , Value=0 },
{ Key=ag_0_1periods_unexpected_closings , Value=0 },
{ Key=ag_0_1periods_warnings , Value=0 },
{ Key=ag_0_1periods_succeeded_queries , Value=27 },
{ Key=ag_0_1periods_msecsperquery , Value=232.31 },
{ Key=ag_0_5periods_query_timeouts , Value=0 },
{ Key=ag_0_5periods_connect_timeouts , Value=0 },
{ Key=ag_0_5periods_connect_failures , Value=0 },
{ Key=ag_0_5periods_network_errors , Value=0 },
{ Key=ag_0_5periods_wrong_replies , Value=0 },
{ Key=ag_0_5periods_unexpected_closings , Value=0 },
{ Key=ag_0_5periods_warnings , Value=0 },
{ Key=ag_0_5periods_succeeded_queries , Value=146 },
{ Key=ag_0_5periods_msecsperquery , Value=231.83 },
{ Key=ag_1_hostname 192.168.0.202:6714 },
{ Key=ag_1_references , Value=2 },
{ Key=ag_1_lastquery , Value=0.41 },
{ Key=ag_1_lastanswer , Value=0.19 },
{ Key=ag_1_lastperiodmsec , Value=220 },
{ Key=ag_1_errorsarow , Value=0 },
{ Key=ag_1_1periods_query_timeouts , Value=0 },
{ Key=ag_1_1periods_connect_timeouts , Value=0 },
{ Key=ag_1_1periods_connect_failures , Value=0 },
{ Key=ag_1_1periods_network_errors , Value=0 },
{ Key=ag_1_1periods_wrong_replies , Value=0 },
{ Key=ag_1_1periods_unexpected_closings , Value=0 },
{ Key=ag_1_1periods_warnings , Value=0 },
{ Key=ag_1_1periods_succeeded_queries , Value=27 },
{ Key=ag_1_1periods_msecsperquery , Value=231.24 },
{ Key=ag_1_5periods_query_timeouts , Value=0 },
{ Key=ag_1_5periods_connect_timeouts , Value=0 },
{ Key=ag_1_5periods_connect_failures , Value=0 },
{ Key=ag_1_5periods_network_errors , Value=0 },
{ Key=ag_1_5periods_wrong_replies , Value=0 },
{ Key=ag_1_5periods_warnings , Value=0 },
{ Key=ag_1_5periods_succeeded_queries , Value=146 },
{ Key=ag_1_5periods_msecsperquery , Value=230.85 }],
error="" ,
total=0,
warning="" }
res = await utilsApi.sql("SHOW AGENT STATUS");
{
"columns":
[{
"Key":
{
"type": "string"
}
},
{
"Value":
{
"type": "string"
}
}],
"data":
[
{"Key": "status_period_seconds", "Value": "60"},
{"Key": "status_stored_periods", "Value": "15"},
{"Key": "ag_0_hostname", "Value": "192.168.0.202:6713"},
{"Key": "ag_0_references", "Value": "2"},
{"Key": "ag_0_lastquery", "Value": "0.41"},
{"Key": "ag_0_lastanswer", "Value": "0.19"},
{"Key": "ag_0_lastperiodmsec", "Value": "222"},
{"Key": "ag_0_errorsarow", "Value": "0"},
{"Key": "ag_0_1periods_query_timeouts", "Value": "0"},
{"Key": "ag_0_1periods_connect_timeouts", "Value": "0"},
{"Key": "ag_0_1periods_connect_failures", "Value": "0"},
{"Key": "ag_0_1periods_network_errors", "Value": "0"},
{"Key": "ag_0_1periods_wrong_replies", "Value": "0"},
{"Key": "ag_0_1periods_unexpected_closings", "Value": "0"},
{"Key": "ag_0_1periods_warnings", "Value": "0"},
{"Key": "ag_0_1periods_succeeded_queries", "Value": "27"},
{"Key": "ag_0_1periods_msecsperquery", "Value": "232.31"},
{"Key": "ag_0_5periods_query_timeouts", "Value": "0"},
{"Key": "ag_0_5periods_connect_timeouts", "Value": "0"},
{"Key": "ag_0_5periods_connect_failures", "Value": "0"},
{"Key": "ag_0_5periods_network_errors", "Value": "0"},
{"Key": "ag_0_5periods_wrong_replies", "Value": "0"},
{"Key": "ag_0_5periods_unexpected_closings", "Value": "0"},
{"Key": "ag_0_5periods_warnings", "Value": "0"},
{"Key": "ag_0_5periods_succeeded_queries", "Value": "146"},
{"Key": "ag_0_5periods_msecsperquery", "Value": "231.83"},
{"Key": "ag_1_hostname 192.168.0.202:6714"},
{"Key": "ag_1_references", "Value": "2"},
{"Key": "ag_1_lastquery", "Value": "0.41"},
{"Key": "ag_1_lastanswer", "Value": "0.19"},
{"Key": "ag_1_lastperiodmsec", "Value": "220"},
{"Key": "ag_1_errorsarow", "Value": "0"},
{"Key": "ag_1_1periods_query_timeouts", "Value": "0"},
{"Key": "ag_1_1periods_connect_timeouts", "Value": "0"},
{"Key": "ag_1_1periods_connect_failures", "Value": "0"},
{"Key": "ag_1_1periods_network_errors", "Value": "0"},
{"Key": "ag_1_1periods_wrong_replies", "Value": "0"},
{"Key": "ag_1_1periods_unexpected_closings", "Value": "0"},
{"Key": "ag_1_1periods_warnings", "Value": "0"},
{"Key": "ag_1_1periods_succeeded_queries", "Value": "27"},
{"Key": "ag_1_1periods_msecsperquery", "Value": "231.24"},
{"Key": "ag_1_5periods_query_timeouts", "Value": "0"},
{"Key": "ag_1_5periods_connect_timeouts", "Value": "0"},
{"Key": "ag_1_5periods_connect_failures", "Value": "0"},
{"Key": "ag_1_5periods_network_errors", "Value": "0"},
{"Key": "ag_1_5periods_wrong_replies", "Value": "0"},
{"Key": "ag_1_5periods_warnings", "Value": "0"},
{"Key": "ag_1_5periods_succeeded_queries", "Value": "146"},
{"Key": "ag_1_5periods_msecsperquery", "Value": "230.85"}
],
"error": "",
"total": 0,
"warning": ""
}
res := apiClient.UtilsAPI.Sql(context.Background()).Body("SHOW AGENT STATUS").Execute()
{
"columns":
[{
"Key":
{
"type": "string"
}
},
{
"Value":
{
"type": "string"
}
}],
"data":
[
{"Key": "status_period_seconds", "Value": "60"},
{"Key": "status_stored_periods", "Value": "15"},
{"Key": "ag_0_hostname", "Value": "192.168.0.202:6713"},
{"Key": "ag_0_references", "Value": "2"},
{"Key": "ag_0_lastquery", "Value": "0.41"},
{"Key": "ag_0_lastanswer", "Value": "0.19"},
{"Key": "ag_0_lastperiodmsec", "Value": "222"},
{"Key": "ag_0_errorsarow", "Value": "0"},
{"Key": "ag_0_1periods_query_timeouts", "Value": "0"},
{"Key": "ag_0_1periods_connect_timeouts", "Value": "0"},
{"Key": "ag_0_1periods_connect_failures", "Value": "0"},
{"Key": "ag_0_1periods_network_errors", "Value": "0"},
{"Key": "ag_0_1periods_wrong_replies", "Value": "0"},
{"Key": "ag_0_1periods_unexpected_closings", "Value": "0"},
{"Key": "ag_0_1periods_warnings", "Value": "0"},
{"Key": "ag_0_1periods_succeeded_queries", "Value": "27"},
{"Key": "ag_0_1periods_msecsperquery", "Value": "232.31"},
{"Key": "ag_0_5periods_query_timeouts", "Value": "0"},
{"Key": "ag_0_5periods_connect_timeouts", "Value": "0"},
{"Key": "ag_0_5periods_connect_failures", "Value": "0"},
{"Key": "ag_0_5periods_network_errors", "Value": "0"},
{"Key": "ag_0_5periods_wrong_replies", "Value": "0"},
{"Key": "ag_0_5periods_unexpected_closings", "Value": "0"},
{"Key": "ag_0_5periods_warnings", "Value": "0"},
{"Key": "ag_0_5periods_succeeded_queries", "Value": "146"},
{"Key": "ag_0_5periods_msecsperquery", "Value": "231.83"},
{"Key": "ag_1_hostname 192.168.0.202:6714"},
{"Key": "ag_1_references", "Value": "2"},
{"Key": "ag_1_lastquery", "Value": "0.41"},
{"Key": "ag_1_lastanswer", "Value": "0.19"},
{"Key": "ag_1_lastperiodmsec", "Value": "220"},
{"Key": "ag_1_errorsarow", "Value": "0"},
{"Key": "ag_1_1periods_query_timeouts", "Value": "0"},
{"Key": "ag_1_1periods_connect_timeouts", "Value": "0"},
{"Key": "ag_1_1periods_connect_failures", "Value": "0"},
{"Key": "ag_1_1periods_network_errors", "Value": "0"},
{"Key": "ag_1_1periods_wrong_replies", "Value": "0"},
{"Key": "ag_1_1periods_unexpected_closings", "Value": "0"},
{"Key": "ag_1_1periods_warnings", "Value": "0"},
{"Key": "ag_1_1periods_succeeded_queries", "Value": "27"},
{"Key": "ag_1_1periods_msecsperquery", "Value": "231.24"},
{"Key": "ag_1_5periods_query_timeouts", "Value": "0"},
{"Key": "ag_1_5periods_connect_timeouts", "Value": "0"},
{"Key": "ag_1_5periods_connect_failures", "Value": "0"},
{"Key": "ag_1_5periods_network_errors", "Value": "0"},
{"Key": "ag_1_5periods_wrong_replies", "Value": "0"},
{"Key": "ag_1_5periods_warnings", "Value": "0"},
{"Key": "ag_1_5periods_succeeded_queries", "Value": "146"},
{"Key": "ag_1_5periods_msecsperquery", "Value": "230.85"}
],
"error": "",
"total": 0,
"warning": ""
}
An optional LIKE clause is supported, with the syntax being the same as in SHOW STATUS.
SHOW AGENT STATUS LIKE '%5period%msec%';
+-----------------------------+--------+
| Key | Value |
+-----------------------------+--------+
| ag_0_5periods_msecsperquery | 234.72 |
| ag_1_5periods_msecsperquery | 233.73 |
| ag_2_5periods_msecsperquery | 343.81 |
+-----------------------------+--------+
3 rows in set (0.00 sec)
$client->nodes()->agentstatus(
['body'=>
['pattern'=>'%5period%msec%']
]
);
Array(
[ag_0_5periods_msecsperquery] => 234.72
[ag_1_5periods_msecsperquery] => 233.73
[ag_2_5periods_msecsperquery] => 343.81
)
utilsApi.sql('SHOW AGENT STATUS LIKE \'%5period%msec%\'')
{u'columns': [{u'Key': {u'type': u'string'}},
{u'Value': {u'type': u'string'}}],
u'data': [
{u'Key': u'ag_0_5periods_msecsperquery', u'Value': u'234.72'},
{u'Key': u'ag_1_5periods_msecsperquery', u'Value': u'233.73'},
{u'Key': u'ag_2_5periods_msecsperquery', u'Value': u'343.81'}],
u'error': u'',
u'total': 0,
u'warning': u''}
res = await utilsApi.sql("SHOW AGENT STATUS LIKE \"%5period%msec%\"");
{"columns": [{"Key": {"type": "string"}},
{"Value": {"type": "string"}}],
"data": [
{"Key": "ag_0_5periods_msecsperquery", "Value": "234.72"},
{"Key": "ag_1_5periods_msecsperquery", "Value": "233.73"},
{"Key": "ag_2_5periods_msecsperquery", "Value": "343.81"}],
"error": "",
"total": 0,
"warning": ""}
utilsApi.sql("SHOW AGENT STATUS LIKE \"%5period%msec%\"");
{columns: [{Key={type=string}},
{Value={type=string}}],
data: [
{Key=ag_0_5periods_msecsperquery, Value=234.72},
{Key=ag_1_5periods_msecsperquery, Value=233.73},
{Key=ag_2_5periods_msecsperquery, Value=343.81}],
error: ,
total: 0,
warning: }
utilsApi.Sql("SHOW AGENT STATUS LIKE \"%5period%msec%\"");
{columns: [{Key={type=string}},
{Value={type=string}}],
data: [
{Key=ag_0_5periods_msecsperquery, Value=234.72},
{Key=ag_1_5periods_msecsperquery, Value=233.73},
{Key=ag_2_5periods_msecsperquery, Value=343.81}],
error: "",
total: 0,
warning: ""}
res = await utilsApi.sql("SHOW AGENT STATUS LIKE \"%5period%msec%\"");
{
"columns":
[{
"Key": {"type": "string"}
},
{
"Value": {"type": "string"}
}],
"data":
[
{"Key": "ag_0_5periods_msecsperquery", "Value": "234.72"},
{"Key": "ag_1_5periods_msecsperquery", "Value": "233.73"},
{"Key": "ag_2_5periods_msecsperquery", "Value": "343.81"}
],
"error": "",
"total": 0,
"warning": ""
}
apiClient.UtilsAPI.Sql(context.Background()).Body("SHOW AGENT STATUS LIKE \"%5period%msec%\"").Execute()
{
"columns":
[{
"Key": {"type": "string"}
},
{
"Value": {"type": "string"}
}],
"data":
[
{"Key": "ag_0_5periods_msecsperquery", "Value": "234.72"},
{"Key": "ag_1_5periods_msecsperquery", "Value": "233.73"},
{"Key": "ag_2_5periods_msecsperquery", "Value": "343.81"}
],
"error": "",
"total": 0,
"warning": ""
}
You can specify a particular agent by its address. In this case, only that agent's data will be displayed. Additionally, the agent_ prefix will be used instead of ag_N_:
SHOW AGENT '192.168.0.202:6714' STATUS LIKE '%15periods%';
+-------------------------------------+--------+
| Variable_name | Value |
+-------------------------------------+--------+
| agent_15periods_query_timeouts | 0 |
| agent_15periods_connect_timeouts | 0 |
| agent_15periods_connect_failures | 0 |
| agent_15periods_network_errors | 0 |
| agent_15periods_wrong_replies | 0 |
| agent_15periods_unexpected_closings | 0 |
| agent_15periods_warnings | 0 |
| agent_15periods_succeeded_queries | 439 |
| agent_15periods_msecsperquery | 231.73 |
+-------------------------------------+--------+
9 rows in set (0.00 sec)
$client->nodes()->agentstatus(
['body'=>
['agent'=>'192.168.0.202:6714'],
['pattern'=>'%5period%msec%']
]
);
Array(
[agent_15periods_query_timeouts] => 0
[agent_15periods_connect_timeouts] => 0
[agent_15periods_connect_failures] => 0
[agent_15periods_network_errors] => 0
[agent_15periods_wrong_replies] => 0
[agent_15periods_unexpected_closings] => 0
[agent_15periods_warnings] => 0
[agent_15periods_succeeded_queries] => 439
[agent_15periods_msecsperquery] => 231.73
)
utilsApi.sql('SHOW AGENT \'192.168.0.202:6714\' STATUS LIKE \'%15periods%\'')
{u'columns': [{u'Key': {u'type': u'string'}},
{u'Value': {u'type': u'string'}}],
u'data': [
{u'Key': u'agent_15periods_query_timeouts', u'Value': u'0'},
{u'Key': u'agent_15periods_connect_timeouts', u'Value': u'0'},
{u'Key': u'agent_15periods_connect_failures', u'Value': u'0'},
{u'Key': u'agent_15periods_network_errors', u'Value': u'0'},
{u'Key': u'agent_15periods_connect_failures', u'Value': u'0'},
{u'Key': u'agent_15periods_wrong_replies', u'Value': u'0'},
{u'Key': u'agent_15periods_unexpected_closings', u'Value': u'0'},
{u'Key': u'agent_15periods_warnings', u'Value': u'0'},
{u'Key': u'agent_15periods_succeeded_queries', u'Value': u'439'},
{u'Key': u'agent_15periods_msecsperquery', u'Value': u'233.73'},
],
u'error': u'',
u'total': 0,
u'warning': u''}
res = await utilsApi.sql("SHOW AGENT \"192.168.0.202:6714\" STATUS LIKE \"%15periods%\"");
{"columns": [{"Key": {"type": "string"}},
{"Value": {"type": "string"}}],
"data": [
{"Key": "agent_15periods_query_timeouts", "Value": "0"},
{"Key": "agent_15periods_connect_timeouts", "Value": "0"},
{"Key": "agent_15periods_connect_failures", "Value": "0"},
{"Key": "agent_15periods_network_errors", "Value": "0"},
{"Key": "agent_15periods_connect_failures", "Value": "0"},
{"Key": "agent_15periods_wrong_replies", "Value": "0"},
{"Key": "agent_15periods_unexpected_closings", "Value": "0"},
{"Key": "agent_15periods_warnings", "Value": "0"},
{"Key": "agent_15periods_succeeded_queries", "Value": "439"},
{"Key": "agent_15periods_msecsperquery", "Value": "233.73"},
],
"error": "",
"total": 0,
"warning": ""}
utilsApi.sql("SHOW AGENT \"192.168.0.202:6714\" STATUS LIKE \"%15periods%\"");
{columns=[{Key={type=string}},
{Value={type=string}}],
data=[
{Key=agent_15periods_query_timeouts, Value=0},
{Key=agent_15periods_connect_timeouts, Value=0},
{Key=agent_15periods_connect_failures, Value=0},
{Key=agent_15periods_network_errors, Value=0},
{Key=agent_15periods_connect_failures, Value=0},
{Key=agent_15periods_wrong_replies, Value=0},
{Key=agent_15periods_unexpected_closings, Value=0},
{Key=agent_15periods_warnings, Value=0},
{Key=agent_15periods_succeeded_queries, Value=439},
{Key=agent_15periods_msecsperquery, Value=233.73},
],
error=,
total=0,
warning=}
utilsApi.Sql("SHOW AGENT \"192.168.0.202:6714\" STATUS LIKE \"%15periods%\"");
{columns=[{Key={type=string}},
{Value={type=string}}],
data=[
{Key=agent_15periods_query_timeouts, Value=0},
{Key=agent_15periods_connect_timeouts, Value=0},
{Key=agent_15periods_connect_failures, Value=0},
{Key=agent_15periods_network_errors, Value=0},
{Key=agent_15periods_connect_failures, Value=0},
{Key=agent_15periods_wrong_replies, Value=0},
{Key=agent_15periods_unexpected_closings, Value=0},
{Key=agent_15periods_warnings, Value=0},
{Key=agent_15periods_succeeded_queries, Value=439},
{Key=agent_15periods_msecsperquery, Value=233.73},
],
error="",
total=0,
warning=""}
res = await utilsApi.sql("SHOW AGENT \"192.168.0.202:6714\" STATUS LIKE \"%15periods%\"");
{
"columns":
[{
{"Key": {"type": "string"}
},
{"Value": {"type": "string"}
}],
"data":
[
{"Key": "agent_15periods_query_timeouts", "Value": "0"},
{"Key": "agent_15periods_connect_timeouts", "Value": "0"},
{"Key": "agent_15periods_connect_failures", "Value": "0"},
{"Key": "agent_15periods_network_errors", "Value": "0"},
{"Key": "agent_15periods_connect_failures", "Value": "0"},
{"Key": "agent_15periods_wrong_replies", "Value": "0"},
{"Key": "agent_15periods_unexpected_closings", "Value": "0"},
{"Key": "agent_15periods_warnings", "Value": "0"},
{"Key": "agent_15periods_succeeded_queries", "Value": "439"},
{"Key": "agent_15periods_msecsperquery", "Value": "233.73"},
],
"error": "",
"total": 0,
"warning": ""
}
apiClient.UtilsAPI.Sql(context.Background()).Body("SHOW AGENT \"192.168.0.202:6714\" STATUS LIKE \"%15periods%\").Execute()
{
"columns":
[{
{"Key": {"type": "string"}
},
{"Value": {"type": "string"}
}],
"data":
[
{"Key": "agent_15periods_query_timeouts", "Value": "0"},
{"Key": "agent_15periods_connect_timeouts", "Value": "0"},
{"Key": "agent_15periods_connect_failures", "Value": "0"},
{"Key": "agent_15periods_network_errors", "Value": "0"},
{"Key": "agent_15periods_connect_failures", "Value": "0"},
{"Key": "agent_15periods_wrong_replies", "Value": "0"},
{"Key": "agent_15periods_unexpected_closings", "Value": "0"},
{"Key": "agent_15periods_warnings", "Value": "0"},
{"Key": "agent_15periods_succeeded_queries", "Value": "439"},
{"Key": "agent_15periods_msecsperquery", "Value": "233.73"},
],
"error": "",
"total": 0,
"warning": ""
}
Finally, you can check the status of the agents in a specific distributed table using the SHOW AGENT index_name STATUS statement. This statement displays the table's HA status (i.e., whether or not it uses agent mirrors at all) and provides information on the mirrors, including: address, blackhole and persistent flags, and the mirror selection probability used when one of the weighted probability strategies is in effect.
SHOW AGENT dist_index STATUS;
+--------------------------------------+--------------------------------+
| Variable_name | Value |
+--------------------------------------+--------------------------------+
| dstindex_1_is_ha | 1 |
| dstindex_1mirror1_id | 192.168.0.202:6713:loc |
| dstindex_1mirror1_probability_weight | 0.372864 |
| dstindex_1mirror1_is_blackhole | 0 |
| dstindex_1mirror1_is_persistent | 0 |
| dstindex_1mirror2_id | 192.168.0.202:6714:loc |
| dstindex_1mirror2_probability_weight | 0.374635 |
| dstindex_1mirror2_is_blackhole | 0 |
| dstindex_1mirror2_is_persistent | 0 |
| dstindex_1mirror3_id | dev1.manticoresearch.com:6714:loc |
| dstindex_1mirror3_probability_weight | 0.252501 |
| dstindex_1mirror3_is_blackhole | 0 |
| dstindex_1mirror3_is_persistent | 0 |
+--------------------------------------+--------------------------------+
13 rows in set (0.00 sec)
$client->nodes()->agentstatus(
['body'=>
['agent'=>'dist_index']
]
);
Array(
[dstindex_1_is_ha] => 1
[dstindex_1mirror1_id] => 192.168.0.202:6713:loc
[dstindex_1mirror1_probability_weight] => 0.372864
[dstindex_1mirror1_is_blackhole] => 0
[dstindex_1mirror1_is_persistent] => 0
[dstindex_1mirror2_id] => 192.168.0.202:6714:loc
[dstindex_1mirror2_probability_weight] => 0.374635
[dstindex_1mirror2_is_blackhole] => 0
[dstindex_1mirror2_is_persistent] => 0
[dstindex_1mirror3_id] => dev1.manticoresearch.com:6714:loc
[dstindex_1mirror3_probability_weight] => 0.252501
[dstindex_1mirror3_is_blackhole] => 0
[dstindex_1mirror3_is_persistent] => 0
)
utilsApi.sql('SHOW AGENT \'192.168.0.202:6714\' STATUS LIKE \'%15periods%\'')
{u'columns': [{u'Key': {u'type': u'string'}},
{u'Value': {u'type': u'string'}}],
u'data': [
{u'Key': u'dstindex_1_is_ha', u'Value': u'1'},
{u'Key': u'dstindex_1mirror1_id', u'Value': u'192.168.0.202:6713:loc'},
{u'Key': u'dstindex_1mirror1_probability_weight', u'Value': u'0.372864'},
{u'Key': u'dstindex_1mirror1_is_blackhole', u'Value': u'0'},
{u'Key': u'dstindex_1mirror1_is_persistent', u'Value': u'0'},
{u'Key': u'dstindex_1mirror2_id', u'Value': u'192.168.0.202:6714:loc'},
{u'Key': u'dstindex_1mirror2_probability_weight', u'Value': u'0.374635'},
{u'Key': u'dstindex_1mirror2_is_blackhole', u'Value': u'0'},
{u'Key': u'dstindex_1mirror2_is_persistent', u'Value': u'439'},
{u'Key': u'dstindex_1mirror3_id', u'Value': u'dev1.manticoresearch.com:6714:loc'},
{u'Key': u'dstindex_1mirror3_probability_weight', u'Value': u' 0.252501'},
{u'Key': u'dstindex_1mirror3_is_blackhole', u'Value': u'0'},
{u'Key': u'dstindex_1mirror3_is_persistent', u'Value': u'439'}
],
u'error': u'',
u'total': 0,
u'warning': u''}
res = await utilsApi.sql("SHOW AGENT \"192.168.0.202:6714\" STATUS LIKE \"%15periods%\"");
{"columns": [{"Key": {"type": "string"}},
{"Value": {"type": "string"}}],
"data": [
{"Key": "dstindex_1_is_ha", "Value": "1"},
{"Key": "dstindex_1mirror1_id", "Value": "192.168.0.202:6713:loc"},
{"Key": "dstindex_1mirror1_probability_weight", "Value": "0.372864"},
{"Key": "dstindex_1mirror1_is_blackhole", "Value": "0"},
{"Key": "dstindex_1mirror1_is_persistent", "Value": "0"},
{"Key": "dstindex_1mirror2_id", "Value": "192.168.0.202:6714:loc"},
{"Key": "dstindex_1mirror2_probability_weight", "Value": "0.374635"},
{"Key": "dstindex_1mirror2_is_blackhole", "Value": "0"},
{"Key": "dstindex_1mirror2_is_persistent", "Value": "439"},
{"Key": "dstindex_1mirror3_id", "Value": "dev1.manticoresearch.com:6714:loc"},
{"Key": "dstindex_1mirror3_probability_weight", "Value": " 0.252501"},
{"Key": "dstindex_1mirror3_is_blackhole", "Value": "0"},
{"Key": "dstindex_1mirror3_is_persistent", "Value": "439"}
],
"error": "",
"total": 0,
"warning": ""}
utilsApi.sql("SHOW AGENT \"192.168.0.202:6714\" STATUS LIKE \"%15periods%\"");
{columns=[{Key={type=string}},
{Value={type=string}}],
data=[
{Key=dstindex_1_is_ha, Value=1},
{Key=dstindex_1mirror1_id, Value=192.168.0.202:6713:loc},
{Key=dstindex_1mirror1_probability_weight, Value=0.372864},
{Key=dstindex_1mirror1_is_blackhole, Value=0},
{Key=dstindex_1mirror1_is_persistent, Value=0},
{Key=dstindex_1mirror2_id, Value=192.168.0.202:6714:loc},
{Key=dstindex_1mirror2_probability_weight, Value=0.374635},
{Key=dstindex_1mirror2_is_blackhole, Value=0},
{Key=dstindex_1mirror2_is_persistent, Value=439},
{Key=dstindex_1mirror3_id, Value=dev1.manticoresearch.com:6714:loc},
{Key=dstindex_1mirror3_probability_weight, Value= 0.252501},
{Key=dstindex_1mirror3_is_blackhole, Value=0},
{Key=dstindex_1mirror3_is_persistent, Value=439}
],
error=,
total=0,
warning=}
utilsApi.Sql("SHOW AGENT \"192.168.0.202:6714\" STATUS LIKE \"%15periods%\"");
{columns=[{Key={type=string}},
{Value={type=string}}],
data=[
{Key=dstindex_1_is_ha, Value=1},
{Key=dstindex_1mirror1_id, Value=192.168.0.202:6713:loc},
{Key=dstindex_1mirror1_probability_weight, Value=0.372864},
{Key=dstindex_1mirror1_is_blackhole, Value=0},
{Key=dstindex_1mirror1_is_persistent, Value=0},
{Key=dstindex_1mirror2_id, Value=192.168.0.202:6714:loc},
{Key=dstindex_1mirror2_probability_weight, Value=0.374635},
{Key=dstindex_1mirror2_is_blackhole, Value=0},
{Key=dstindex_1mirror2_is_persistent, Value=439},
{Key=dstindex_1mirror3_id, Value=dev1.manticoresearch.com:6714:loc},
{Key=dstindex_1mirror3_probability_weight, Value= 0.252501},
{Key=dstindex_1mirror3_is_blackhole, Value=0},
{Key=dstindex_1mirror3_is_persistent, Value=439}
],
error="",
total=0,
warning=""}
res = await utilsApi.sql("SHOW AGENT \"192.168.0.202:6714\" STATUS LIKE \"%15periods%\"");
{
"columns":
[{
"Key": {"type": "string"}},
{"Value": {"type": "string"}
}],
"data":
[
{"Key": "dstindex_1_is_ha", "Value": "1"},
{"Key": "dstindex_1mirror1_id", "Value": "192.168.0.202:6713:loc"},
{"Key": "dstindex_1mirror1_probability_weight", "Value": "0.372864"},
{"Key": "dstindex_1mirror1_is_blackhole", "Value": "0"},
{"Key": "dstindex_1mirror1_is_persistent", "Value": "0"},
{"Key": "dstindex_1mirror2_id", "Value": "192.168.0.202:6714:loc"},
{"Key": "dstindex_1mirror2_probability_weight", "Value": "0.374635"},
{"Key": "dstindex_1mirror2_is_blackhole", "Value": "0"},
{"Key": "dstindex_1mirror2_is_persistent", "Value": "439"},
{"Key": "dstindex_1mirror3_id", "Value": "dev1.manticoresearch.com:6714:loc"},
{"Key": "dstindex_1mirror3_probability_weight", "Value": " 0.252501"},
{"Key": "dstindex_1mirror3_is_blackhole", "Value": "0"},
{"Key": "dstindex_1mirror3_is_persistent", "Value": "439"}
],
"error": "",
"total": 0,
"warning": ""
}
apiClient.UtilsAPI.Sql(context.Background()).Body("SHOW AGENT \"192.168.0.202:6714\" STATUS LIKE \"%15periods%\"").Execute()
{
"columns":
[{
"Key": {"type": "string"}},
{"Value": {"type": "string"}
}],
"data":
[
{"Key": "dstindex_1_is_ha", "Value": "1"},
{"Key": "dstindex_1mirror1_id", "Value": "192.168.0.202:6713:loc"},
{"Key": "dstindex_1mirror1_probability_weight", "Value": "0.372864"},
{"Key": "dstindex_1mirror1_is_blackhole", "Value": "0"},
{"Key": "dstindex_1mirror1_is_persistent", "Value": "0"},
{"Key": "dstindex_1mirror2_id", "Value": "192.168.0.202:6714:loc"},
{"Key": "dstindex_1mirror2_probability_weight", "Value": "0.374635"},
{"Key": "dstindex_1mirror2_is_blackhole", "Value": "0"},
{"Key": "dstindex_1mirror2_is_persistent", "Value": "439"},
{"Key": "dstindex_1mirror3_id", "Value": "dev1.manticoresearch.com:6714:loc"},
{"Key": "dstindex_1mirror3_probability_weight", "Value": " 0.252501"},
{"Key": "dstindex_1mirror3_is_blackhole", "Value": "0"},
{"Key": "dstindex_1mirror3_is_persistent", "Value": "439"}
],
"error": "",
"total": 0,
"warning": ""
}
SHOW META [ LIKE pattern ]
SHOW META is an SQL statement that displays additional meta-information about the processed query, including the query time, keyword statistics, and information about the secondary indexes used. The syntax is:
The included items are:
total: The number of matches actually retrieved and sent to the client.total_found: The estimated total number of matches for the query in the index.total_relation: If Manticore cannot calculate the exact total value, this field will display total_relation: gte, indicating that the actual count is Greater Than or Equal to total_found. If the total value is precise, total_relation: eq will be shown.time: The duration (in seconds) it took to process the search query.keyword[N]: The n-th keyword used in the search query. Note that the keyword can be presented as a wildcard, e.g., abc*.docs[N]: The total number of documents (or records) containing the n-th keyword from the search query. If the keyword is presented as a wildcard, this value represents the sum of documents for all expanded sub-keywords, potentially exceeding the actual number of matched documents.hits[N]: The total number of occurrences (or hits) of the n-th keyword across all documents.index: Information about the utilized index (e.g., secondary index).SELECT id, story_author FROM hn_small WHERE MATCH('one|two|three') and comment_ranking > 2 limit 5;
show meta;
+---------+--------------+
| id | story_author |
+---------+--------------+
| 151171 | anewkid |
| 302758 | bks |
| 805806 | drRoflol |
| 1099245 | tnorthcutt |
| 303252 | whiten |
+---------+--------------+
5 rows in set (0.00 sec)
+----------------+---------------------------------------+
| Variable_name | Value |
+----------------+---------------------------------------+
| total | 5 |
| total_found | 2308 |
| total_relation | eq |
| time | 0.001 |
| keyword[0] | one |
| docs[0] | 224387 |
| hits[0] | 310327 |
| keyword[1] | three |
| docs[1] | 18181 |
| hits[1] | 21102 |
| keyword[2] | two |
| docs[2] | 63251 |
| hits[2] | 75961 |
| index | comment_ranking:SecondaryIndex (100%) |
+----------------+---------------------------------------+
14 rows in set (0.00 sec)
SHOW META can display I/O and CPU counters, but they will only be available if searchd was started with the --iostats and --cpustats switches, respectively.
SELECT id,channel_id FROM records WHERE MATCH('one|two|three') limit 5;
SHOW META;
+--------+--------------+
| id | story_author |
+--------+--------------+
| 300263 | throwaway37 |
| 713503 | mahmud |
| 716804 | mahmud |
| 776906 | jimbokun |
| 753332 | foxhop |
+--------+--------------+
5 rows in set (0.01 sec)
+-----------------------+--------+
| Variable_name | Value |
+-----------------------+--------+
| total | 5 |
| total_found | 266385 |
| total_relation | eq |
| time | 0.011 |
| cpu_time | 18.004 |
| agents_cpu_time | 0.000 |
| io_read_time | 0.000 |
| io_read_ops | 0 |
| io_read_kbytes | 0.0 |
| io_write_time | 0.000 |
| io_write_ops | 0 |
| io_write_kbytes | 0.0 |
| agent_io_read_time | 0.000 |
| agent_io_read_ops | 0 |
| agent_io_read_kbytes | 0.0 |
| agent_io_write_time | 0.000 |
| agent_io_write_ops | 0 |
| agent_io_write_kbytes | 0.0 |
| keyword[0] | one |
| docs[0] | 224387 |
| hits[0] | 310327 |
| keyword[1] | three |
| docs[1] | 18181 |
| hits[1] | 21102 |
| keyword[2] | two |
| docs[2] | 63251 |
| hits[2] | 75961 |
+-----------------------+--------+
27 rows in set (0.00 sec)
Additional values, such as predicted_time, dist_predicted_time, local_fetched_docs, local_fetched_hits, local_fetched_skips, and their respective dist_fetched_* counterparts, will only be available if searchd was configured with predicted time costs and the query included predicted_time in the OPTION clause.
SELECT id,story_author FROM hn_small WHERE MATCH('one|two|three') limit 5 option max_predicted_time=100;
SHOW META;
+--------+--------------+
| id | story_author |
+--------+--------------+
| 300263 | throwaway37 |
| 713503 | mahmud |
| 716804 | mahmud |
| 776906 | jimbokun |
| 753332 | foxhop |
+--------+--------------+
5 rows in set (0.01 sec)
mysql> show meta;
+---------------------+--------+
| Variable_name | Value |
+---------------------+--------+
| total | 5 |
| total_found | 266385 |
| total_relation | eq |
| time | 0.012 |
| local_fetched_docs | 307212 |
| local_fetched_hits | 407390 |
| local_fetched_skips | 24 |
| predicted_time | 56 |
| keyword[0] | one |
| docs[0] | 224387 |
| hits[0] | 310327 |
| keyword[1] | three |
| docs[1] | 18181 |
| hits[1] | 21102 |
| keyword[2] | two |
| docs[2] | 63251 |
| hits[2] | 75961 |
+---------------------+--------+
17 rows in set (0.00 sec)
SHOW META must be executed immediately after the query in the same session. Since some MySQL connectors/libraries use connection pools, running SHOW META in a separate statement can lead to unexpected results, such as retrieving metadata from another query. In these cases (and generally recommended), run a multiple statement containing both the query and SHOW META. Some connectors/libraries support multi-queries within the same method for a single statement, while others may require the use of a dedicated method for multi-queries or setting specific options during connection setup.
SELECT id,story_author FROM hn_small WHERE MATCH('one|two|three') LIMIT 5; SHOW META;
+--------+--------------+
| id | story_author |
+--------+--------------+
| 300263 | throwaway37 |
| 713503 | mahmud |
| 716804 | mahmud |
| 776906 | jimbokun |
| 753332 | foxhop |
+--------+--------------+
5 rows in set (0.01 sec)
+----------------+--------+
| Variable_name | Value |
+----------------+--------+
| total | 5 |
| total_found | 266385 |
| total_relation | eq |
| time | 0.011 |
| keyword[0] | one |
| docs[0] | 224387 |
| hits[0] | 310327 |
| keyword[1] | three |
| docs[1] | 18181 |
| hits[1] | 21102 |
| keyword[2] | two |
| docs[2] | 63251 |
| hits[2] | 75961 |
+----------------+--------+
13 rows in set (0.00 sec)
You can also use the optional LIKE clause, which allows you to select only the variables that match a specific pattern. The pattern syntax follows standard SQL wildcards, where % represents any number of any characters, and _ represents a single character.
SHOW META LIKE 'total%';
+----------------+--------+
| Variable_name | Value |
+----------------+--------+
| total | 5 |
| total_found | 266385 |
| total_relation | eq |
+----------------+--------+
3 rows in set (0.00 sec)
When utilizing faceted search, you can examine the multiplier field in the SHOW META output to determine how many queries were executed in an optimized group.
SELECT * FROM facetdemo FACET brand_id FACET price FACET categories;
SHOW META LIKE 'multiplier';
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+
| id | price | brand_id | title | brand_name | property | j | categories |
+------+-------+----------+---------------------+-------------+-------------+---------------------------------------+------------+
| 1 | 306 | 1 | Product Ten Three | Brand One | Six_Ten | {"prop1":66,"prop2":91,"prop3":"One"} | 10,11 |
...
+----------+----------+
| brand_id | count(*) |
+----------+----------+
| 1 | 1013 |
...
+-------+----------+
| price | count(*) |
+-------+----------+
| 306 | 7 |
...
+------------+----------+
| categories | count(*) |
+------------+----------+
| 10 | 2436 |
...
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| multiplier | 4 |
+---------------+-------+
1 row in set (0.00 sec)
When the cost-based query optimizer chooses to use DocidIndex, ColumnarScan, or SecondaryIndex instead of a plain filter, this is reflected in the SHOW META command.
The index variable displays the names and types of secondary indexes used during query execution. The percentage indicates how many disk chunks (in the case of an RT table) or pseudo shards (in the case of a plain table) utilized the secondary index.
SELECT count(*) FROM taxi1 WHERE tip_amount = 5;
SHOW META;
+----------------+----------------------------------+
| Variable_name | Value |
+----------------+----------------------------------+
| total | 1 |
| total_found | 1 |
| total_relation | eq |
| time | 0.016 |
| index | tip_amount:SecondaryIndex (100%) |
+----------------+----------------------------------+
5 rows in set (0.00 sec)
SHOW META can be used after executing a CALL PQ statement, in which case it provides different output.
SHOW META following a CALL PQ statement includes:
Total - Total time spent on matching the document(s)Queries matched - Number of stored queries that match the document(s)Document matches - Number of documents that matched the queries stored in the tableTotal queries stored - Total number of queries stored in the tableTerm only queries - Number of queries in the table that have terms; the remaining queries use extended query syntax.CALL PQ ('pq', ('{"title":"angry", "gid":3 }')); SHOW META;
+------+
| id |
+------+
| 2 |
+------+
1 row in set (0.00 sec)
+-----------------------+-----------+
| Name | Value |
+-----------------------+-----------+
| Total | 0.000 sec |
| Queries matched | 1 |
| Queries failed | 0 |
| Document matched | 1 |
| Total queries stored | 2 |
| Term only queries | 2 |
| Fast rejected queries | 1 |
+-----------------------+-----------+
7 rows in set (0.00 sec)
Using CALL PQ with a verbose option provides more detailed output.
It includes the following additional entries:
Setup - Time spent on the initial setup of the matching process, such as parsing docs and setting optionsQueries failed - Number of queries that failedFast rejected queries - Number of queries that were not fully evaluated but quickly matched and rejected using filters or other conditionsTime per query - Detailed time for each queryTime of matched queries - Total time spent on queries that matched any documentsCALL PQ ('pq', ('{"title":"angry", "gid":3 }'), 1 as verbose); SHOW META;
+------+
| id |
+------+
| 2 |
+------+
1 row in set (0.00 sec)
+-------------------------+-----------+
| Name | Value |
+-------------------------+-----------+
| Total | 0.000 sec |
| Setup | 0.000 sec |
| Queries matched | 1 |
| Queries failed | 0 |
| Document matched | 1 |
| Total queries stored | 2 |
| Term only queries | 2 |
| Fast rejected queries | 1 |
| Time per query | 69 |
| Time of matched queries | 69 |
+-------------------------+-----------+
10 rows in set (0.00 sec)
SHOW THREADS [ OPTION columns=width[,format=sphinxql][,format=all] ]
SHOW THREADS is an SQL statement that displays information about all threads and their current activities.
The resulting table contains the following columns:
TID: ID assigned to the thread by the kernelName: Thread name, also visible in top, htop, ps, and other process-viewing toolsProto: Connection protocol; possible values include sphinx, mysql, http, ssl, compressed, replication, or a combination (e.g., http,ssl or compressed,mysql)State: Thread state; possible values are handshake, net_read, net_write, query, net_idleConnection from: Client's ip:portConnID: Connection ID (starting from 0)This/prev job time: When the thread is busy - how long the current job has been running; when the thread is idling - previous job duration + suffix prevJobs done: Number of jobs completed by this threadThread status: idling or workingInfo: Information about the query, which may include multiple queries if the query targets a distributed table or a real-time tableSHOW THREADS;
*************************** 1. row ***************************
TID: 83
Name: work_1
Proto: mysql
State: query
Connection from: 172.17.0.1:43300
ConnID: 8
This/prev job time: 630us
CPU activity: 94.15%
Jobs done: 2490
Thread status: working
Info: SHOW THREADS
*************************** 2. row ***************************
TID: 84
Name: work_2
Proto: mysql
State: query
Connection from: 172.17.0.1:43301
ConnID: 9
This/prev job time: 689us
CPU activity: 89.23%
Jobs done: 1830
Thread status: working
Info: show threads
POST /cli -d "SHOW THREADS"
+--------+---------+-------+-------+-----------------+--------+-----------------------+-----------+---------------+--------------+
| TID | Name | Proto | State | Connection from | ConnID | This/prev job time, s | Jobs done | Thread status | Info |
+--------+---------+-------+-------+-----------------+--------+-----------------------+-----------+---------------+--------------+
| 501494 | work_23 | http | query | 127.0.0.1:41300 | 1473 | 249us | 1681 | working | show_threads |
+--------+---------+-------+-------+-----------------+--------+-----------------------+-----------+---------------+--------------+
require_once __DIR__ . '/vendor/autoload.php';
$config = ['host'=>'127.0.0.1','port'=>9308];
$client = new \Manticoresearch\Client($config);
print_r($client->nodes()->threads());
Array
(
[0] => Array
(
[TID] => 506960
[Name] => work_8
[Proto] => http
[State] => query
[Connection from] => 127.0.0.1:38072
[ConnID] => 17
[This/prev job time, s] => 231us
[CPU activity] => 93.54%
[Jobs done] => 8
[Thread status] => working
[Info] => show_threads
)
)
import manticoresearch
config = manticoresearch.Configuration(
host = "http://127.0.0.1:9308"
)
client = manticoresearch.ApiClient(config)
utilsApi = manticoresearch.UtilsApi(client)
print(utilsApi.sql('SHOW THREADS'))
[{'columns': [{'TID': {'type': 'long'}}, {'Name': {'type': 'string'}}, {'Proto': {'type': 'string'}}, {'State': {'type': 'string'}}, {'Connection from': {'type': 'string'}}, {'ConnID': {'type': 'long long'}}, {'This/prev job time, s': {'type': 'string'}}, {'CPU activity': {'type': 'float'}}, {'Jobs done': {'type': 'long'}}, {'Thread status': {'type': 'string'}}, {'Info': {'type': 'string'}}], 'data': [{'TID': 506958, 'Name': 'work_6', 'Proto': 'http', 'State': 'query', 'Connection from': '127.0.0.1:38600', 'ConnID': 834, 'This/prev job time, s': '206us', 'CPU activity': '91.85%', 'Jobs done': 943, 'Thread status': 'working', 'Info': 'show_threads'}], 'total': 1, 'error': '', 'warning': ''}]
var Manticoresearch = require('manticoresearch');
var utilsApi = new Manticoresearch.UtilsApi();
async function showThreads() {
res = await utilsApi.sql('SHOW THREADS');
console.log(JSON.stringify(res, null, 4));
}
showThreads();
[
{
"columns": [
{
"TID": {
"type": "long"
}
},
{
"Name": {
"type": "string"
}
},
{
"Proto": {
"type": "string"
}
},
{
"State": {
"type": "string"
}
},
{
"Connection from": {
"type": "string"
}
},
{
"ConnID": {
"type": "long long"
}
},
{
"This/prev job time, s": {
"type": "string"
}
},
{
"CPU activity": {
"type": "float"
}
},
{
"Jobs done": {
"type": "long"
}
},
{
"Thread status": {
"type": "string"
}
},
{
"Info": {
"type": "string"
}
}
],
"data": [
{
"TID": 506964,
"Name": "work_12",
"Proto": "http",
"State": "query",
"Connection from": "127.0.0.1:36656",
"ConnID": 2884,
"This/prev job time, s": "236us",
"CPU activity": "91.73%",
"Jobs done": 3328,
"Thread status": "working",
"Info": "show_threads"
}
],
"total": 1,
"error": "",
"warning": ""
}
]
utilsApi.sql("SHOW THREADS");
{
columns=[
{
TID={
type=string
}
},
{
Name={
type=string
}
},
{
Proto={
type=string
}
},
{
State={
type=string
}
},
{
Connection from={
type=string
}
},
{
ConnID={
type=string
}
},
{
This/prev job time={
type=string
}
},
{
CPU activity={
type=string
}
},
{
Jobs done={
type=string
}
},
{
Thread status={
type=string
}
},
{
Info={
type=string
}
}
],
data=[
{
TID=82,
Name=work_0,
Proto=http,
State=query,
Connection from=172.17.0.1:60550,
ConnID=163,
This/prev job time=105us,
CPU activity=44.68%,
Jobs done=849,
Thread status=working,
Info=show_threads
}
],
total=0,
error=,
warning=
}
utilsApi.Sql("SHOW THREADS");
{
columns=[
{
TID={
type=string
}
},
{
Name={
type=string
}
},
{
Proto={
type=string
}
},
{
State={
type=string
}
},
{
Connection from={
type=string
}
},
{
ConnID={
type=string
}
},
{
This/prev job time= {
type=string
}
},
{
Jobs done={
type=string
}
},
{
Thread status={
type=string
}
},
{
Info={
type=string
}
}
],
data=[
{
TID=83,
Name=work_1,
Proto=http,
State=query,
Connection from=172.17.0.1:41410,
ConnID=6,
This/prev job time=689us,
Jobs done=159,
Thread status=working,
Info=show_threads
}
],
total=0,
error="",
warning=""
}
res = await utilsApi.sql('SHOW THREADS');
[
{
"columns": [
{
"TID": {
"type": "long"
}
},
{
"Name": {
"type": "string"
}
},
{
"Proto": {
"type": "string"
}
},
{
"State": {
"type": "string"
}
},
{
"Connection from": {
"type": "string"
}
},
{
"ConnID": {
"type": "long long"
}
},
{
"This/prev job time, s": {
"type": "string"
}
},
{
"CPU activity": {
"type": "float"
}
},
{
"Jobs done": {
"type": "long"
}
},
{
"Thread status": {
"type": "string"
}
},
{
"Info": {
"type": "string"
}
}
],
"data": [
{
"TID": 506964,
"Name": "work_12",
"Proto": "http",
"State": "query",
"Connection from": "127.0.0.1:36656",
"ConnID": 2884,
"This/prev job time, s": "236us",
"CPU activity": "91.73%",
"Jobs done": 3328,
"Thread status": "working",
"Info": "show_threads"
}
],
"total": 1,
"error": "",
"warning": ""
}
]
apiClient.UtilsAPI.Sql(context.Background()).Body("SHOW THREADS").Execute()
[
{
"columns": [
{
"TID": {
"type": "long"
}
},
{
"Name": {
"type": "string"
}
},
{
"Proto": {
"type": "string"
}
},
{
"State": {
"type": "string"
}
},
{
"Connection from": {
"type": "string"
}
},
{
"ConnID": {
"type": "long long"
}
},
{
"This/prev job time, s": {
"type": "string"
}
},
{
"CPU activity": {
"type": "float"
}
},
{
"Jobs done": {
"type": "long"
}
},
{
"Thread status": {
"type": "string"
}
},
{
"Info": {
"type": "string"
}
}
],
"data": [
{
"TID": 506964,
"Name": "work_12",
"Proto": "http",
"State": "query",
"Connection from": "127.0.0.1:36656",
"ConnID": 2884,
"This/prev job time, s": "236us",
"CPU activity": "91.73%",
"Jobs done": 3328,
"Thread status": "working",
"Info": "show_threads"
}
],
"total": 1,
"error": "",
"warning": ""
}
]
The Info column displays:
You can limit the maximum width of the Info column by specifying the columns=N option.
By default, queries are displayed in their original format. However, when the format=sphinxql option is used, queries will be shown in SQL format, regardless of the protocol used for execution.
Using format=all will show all threads, while idling and system threads are hidden without this option (e.g., those busy with OPTIMIZE).
SHOW THREADS OPTION columns=30\G
POST /cli -d "SHOW THREADS OPTION columns=30"
$client->nodes()->threads(['body'=>['columns'=>30]]);
utilsApi.sql('SHOW THREADS OPTION columns=30')
res = await utilsApi.sql('SHOW THREADS OPTION columns=30');
utilsApi.sql("SHOW THREADS OPTION columns=30");
utilsApi.Sql("SHOW THREADS OPTION columns=30");
res = await utilsApi.sql('SHOW THREADS OPTION columns=30');
apiClient.UtilsAPI.Sql(context.Background()).Body("SHOW THREADS OPTION columns=30").Execute()
SHOW QUERIES
SHOW QUERIES returns information about all currently running queries. The output is a table with the following structure:
id: Query ID that can be used in KILL to terminate the queryquery: Query statement or a portion of ittime: Time taken on command execution or how long ago the query was performedprotocol: Connection protocol, with possible values being sphinx, mysql, http, ssl, compressed, replication, or a combination (e.g., http,ssl or compressed,mysql)host: Client's ip:portmysql> SHOW QUERIES;
+------+--------------+---------+----------+-----------------+
| id | query | time | protocol | host |
+------+--------------+---------+----------+-----------------+
| 111 | select | 5ms ago | http | 127.0.0.1:58986 |
| 96 | SHOW QUERIES | 255us | mysql | 127.0.0.1:33616 |
+------+--------------+---------+----------+-----------------+
2 rows in set (0.61 sec)
Refer to SHOW THREADS if you'd like to gain insight from the perspective of the threads themselves.
SHOW VERSION
SHOW VERSION provides detailed version information of various components of the Manticore Search instance. This command is particularly useful for administrators and developers who need to verify the version of Manticore Search they are running, along with the versions of its associated components.
The output table includes two columns:
- Component: This column names the specific component of Manticore Search.
- Version: This column displays the version information for the respective component.
mysql> SHOW VERSION;
+-----------+--------------------------------+
| Component | Version |
+-----------+--------------------------------+
| Daemon | 6.2.13 61cfe38d2@24011520 dev |
| Columnar | columnar 2.2.5 214ce90@240115 |
| Secondary | secondary 2.2.5 214ce90@240115 |
| KNN | knn 2.2.5 214ce90@240115 |
| Buddy | buddy v2.0.11 |
+-----------+--------------------------------+
KILL <query id>
KILL terminates the execution of a query by its ID, which you can find in SHOW QUERIES.
mysql> KILL 4;
Query OK, 1 row affected (0.00 sec)
SHOW WARNINGS statement can be used to retrieve the warning produced by the latest query. The error message will be returned along with the query itself:
mysql> SELECT * FROM test1 WHERE MATCH('@@title hello') \G
ERROR 1064 (42000): index test1: syntax error, unexpected TOK_FIELDLIMIT
near '@title hello'
mysql> SELECT * FROM test1 WHERE MATCH('@title -hello') \G
ERROR 1064 (42000): index test1: query is non-computable (single NOT operator)
mysql> SELECT * FROM test1 WHERE MATCH('"test doc"/3') \G
*************************** 1\. row ***************************
id: 4
weight: 2500
group_id: 2
date_added: 1231721236
1 row in set, 1 warning (0.00 sec)
mysql> SHOW WARNINGS \G
*************************** 1\. row ***************************
Level: warning
Code: 1000
Message: quorum threshold too high (words=2, thresh=3); replacing quorum operator
with AND operator
1 row in set (0.00 sec)
SHOW [{GLOBAL | SESSION}] VARIABLES LIKE 'pattern'
It returns the current values of a few server-wide variables. Also, support for GLOBAL and SESSION clauses was added.
mysql> SHOW GLOBAL VARIABLES;
+--------------------------+-----------+
| Variable_name | Value |
+--------------------------+-----------+
| autocommit | 1 |
| collation_connection | libc_ci |
| query_log_format | sphinxql |
| log_level | info |
| max_allowed_packet | 134217728 |
| character_set_client | utf8 |
| character_set_connection | utf8 |
| grouping_in_utc | 0 |
| last_insert_id | 123, 200 |
+--------------------------+-----------+
9 rows in set (0.00 sec)
mysql> show variables like '%log%';
+------------------+----------+
| Variable_name | Value |
+------------------+----------+
| query_log_format | sphinxql |
| log_level | info |
+------------------+----------+
2 rows in set (0.00 sec)
mysql> show session variables like 'autocommit';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| autocommit | 0 |
+---------------+-------+
1 row in set (0.00 sec)
The SQL SHOW PROFILE statement and the "profile": true JSON interface option both provide a detailed execution profile of the executed query. In the case of SQL, profiling must be enabled in the current session before running the statement to be instrumented. This can be accomplished with the SET profiling=1 statement. By default, profiling is disabled to prevent potential performance implications, resulting in an empty profile if not enabled.
Each profiling result includes the following fields:
Status column briefly describes the specific state where the time was spent. See below.Duration column shows the wall clock time, in seconds.Switches column displays the number of times the query engine changed to the given state. These are merely logical engine state switches and not any OS level context switches or function calls (although some sections might actually map to function calls), and they do not have any direct effect on performance. In a sense, the number of switches is just the number of times the respective instrumentation point was hit.Percent column shows the percentage of time spent in this state.States in the profile are returned in a prerecorded order that roughly maps (but is not identical) to the actual query order.
The list of states may (and will) change over time as we refine the states. Here's a brief description of the currently profiled states.
unknown: generic catch-all state. Accounts for not-yet-instrumented code or small miscellaneous tasks that don't really belong in any other state but are too small to warrant their own state.net_read: reading the query from the network (i.e., the application).io: generic file IO time.dist_connect: connecting to remote agents in the distributed table case.sql_parse: parsing the SQL syntax.dict_setup: dictionary and tokenizer setup.parse: parsing the full-text query syntax.transforms: full-text query transformations (wildcard and other expansions, simplification, etc.).init: initializing the query evaluation.open: opening the table files.read_docs: IO time spent reading document lists.read_hits: IO time spent reading keyword positions.get_docs: computing the matching documents.get_hits: computing the matching positions.filter: filtering the full-text matches.rank: computing the relevance rank.sort: sorting the matches.finalize: finalizing the per-table search result set (last stage expressions, etc.).dist_wait: waiting for remote results from agents in the distributed table case.aggregate: aggregating multiple result sets.net_write: writing the result set to the network.SET profiling=1;
SELECT id FROM forum WHERE MATCH('the best') LIMIT 1;
SHOW PROFILE;
Query OK, 0 rows affected (0.00 sec)
+--------+
| id |
+--------+
| 241629 |
+--------+
1 row in set (0.35 sec)
+--------------+----------+----------+---------+
| Status | Duration | Switches | Percent |
+--------------+----------+----------+---------+
| unknown | 0.000557 | 5 | 0.16 |
| net_read | 0.000016 | 1 | 0.00 |
| local_search | 0.000076 | 1 | 0.02 |
| sql_parse | 0.000121 | 1 | 0.03 |
| dict_setup | 0.000003 | 1 | 0.00 |
| parse | 0.000072 | 1 | 0.02 |
| transforms | 0.000331 | 2 | 0.10 |
| init | 0.001945 | 3 | 0.56 |
| read_docs | 0.001257 | 76 | 0.36 |
| read_hits | 0.002598 | 186 | 0.75 |
| get_docs | 0.089328 | 2691 | 25.80 |
| get_hits | 0.189626 | 2799 | 54.78 |
| filter | 0.009369 | 2613 | 2.71 |
| rank | 0.029669 | 7760 | 8.57 |
| sort | 0.019070 | 2531 | 5.51 |
| finalize | 0.000001 | 1 | 0.00 |
| clone_attrs | 0.002009 | 1 | 0.58 |
| aggregate | 0.000054 | 2 | 0.02 |
| net_write | 0.000076 | 2 | 0.02 |
| eval_post | 0.000001 | 1 | 0.00 |
| total | 0.346179 | 18678 | 0 |
+--------------+----------+----------+---------+
21 rows in set (0.00 sec)
POST /search
{
"index": "test",
"profile": true,
"query":
{
"match_phrase": { "_all" : "had grown quite" }
}
}
"profile": {
"query": [
{
"status": "unknown",
"duration": 0.000141,
"switches": 8,
"percent": 2.17
},
{
"status": "local_df",
"duration": 0.000870,
"switches": 1,
"percent": 13.40
},
{
"status": "local_search",
"duration": 0.001038,
"switches": 2,
"percent": 15.99
},
{
"status": "setup_iter",
"duration": 0.000154,
"switches": 14,
"percent": 2.37
},
{
"status": "dict_setup",
"duration": 0.000026,
"switches": 3,
"percent": 0.40
},
{
"status": "parse",
"duration": 0.000205,
"switches": 3,
"percent": 3.15
},
{
"status": "transforms",
"duration": 0.000974,
"switches": 4,
"percent": 15.01
},
{
"status": "init",
"duration": 0.002931,
"switches": 20,
"percent": 45.16
},
{
"status": "get_docs",
"duration": 0.000007,
"switches": 7,
"percent": 0.10
},
{
"status": "rank",
"duration": 0.000002,
"switches": 14,
"percent": 0.03
},
{
"status": "finalize",
"duration": 0.000013,
"switches": 7,
"percent": 0.20
},
{
"status": "aggregate",
"duration": 0.000128,
"switches": 1,
"percent": 1.97
},
{
"status": "total",
"duration": 0.006489,
"switches": 84,
"percent": 100.00
}
]
}
The SHOW PLAN SQL statement and the "plan": N JSON interface option display the query execution plan. The plan is generated and stored during the actual execution, so in the case of SQL, profiling must be enabled in the current session before running that statement. This can be done with a SET profiling=1 statement.
Two items are returned in SQL mode:
transformed_tree, which displays the full-text query decomposition.enabled_indexes, which provides information about effective secondary indexes.To view the query execution plan in a JSON query, add "plan": N to the query. The result will appear as a plan property in the result set. N can be one of the following:
SHOW PLAN SQL query. This is the most compact form.set profiling=1;
select * from hn_small where match('dog|cat') limit 0;
show plan;
*************************** 1. row ***************************
Variable: transformed_tree
Value: OR(
AND(KEYWORD(dog, querypos=1)),
AND(KEYWORD(cat, querypos=2)))
*************************** 2. row ***************************
Variable: enabled_indexes
Value:
2 rows in set (0.00 sec)
POST /search
{
"index": "hn_small",
"query": {"query_string": "dog|cat"},
"_source": { "excludes":["*"] },
"limit": 0,
"plan": 3
}
{
"took": 0,
"timed_out": false,
"hits": {
"total": 4453,
"total_relation": "eq",
"hits": []
},
"plan": {
"query": {
"type": "OR",
"description": "OR( AND(KEYWORD(dog, querypos=1)), AND(KEYWORD(cat, querypos=2)))",
"children": [
{
"type": "AND",
"description": "AND(KEYWORD(dog, querypos=1))",
"children": [
{
"type": "KEYWORD",
"word": "dog",
"querypos": 1
}
]
},
{
"type": "AND",
"description": "AND(KEYWORD(cat, querypos=2))",
"children": [
{
"type": "KEYWORD",
"word": "cat",
"querypos": 2
}
]
}
]
}
}
}
In some cases, the evaluated query tree can be quite different from the original one due to expansions and other transformations.
SET profiling=1;
SELECT id FROM forum WHERE MATCH('@title way* @content hey') LIMIT 1;
SHOW PLAN;
Query OK, 0 rows affected (0.00 sec)
+--------+
| id |
+--------+
| 711651 |
+--------+
1 row in set (0.04 sec)
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Variable | Value |
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| transformed_tree | AND(
OR(
OR(
AND(fields=(title), KEYWORD(wayne, querypos=1, expanded)),
OR(
AND(fields=(title), KEYWORD(ways, querypos=1, expanded)),
AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded)))),
AND(fields=(title), KEYWORD(way, querypos=1, expanded)),
OR(fields=(title), KEYWORD(way*, querypos=1, expanded))),
AND(fields=(content), KEYWORD(hey, querypos=2))) |
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
POST /search
{
"index": "forum",
"query": {"query_string": "@title way* @content hey"},
"_source": { "excludes":["*"] },
"limit": 1,
"plan": 3
}
{
"took":33,
"timed_out":false,
"hits":
{
"total":105,
"hits":
[
{
"_id":"711651",
"_score":2539,
"_source":{}
}
]
},
"plan":
{
"query":
{
"type":"AND",
"description":"AND( OR( OR( AND(fields=(title), KEYWORD(wayne, querypos=1, expanded)), OR( AND(fields=(title), KEYWORD(ways, querypos=1, expanded)), AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded)))), AND(fields=(title), KEYWORD(way, querypos=1, expanded)), OR(fields=(title), KEYWORD(way*, querypos=1, expanded))), AND(fields=(content), KEYWORD(hey, querypos=2)))",
"children":
[
{
"type":"OR",
"description":"OR( OR( AND(fields=(title), KEYWORD(wayne, querypos=1, expanded)), OR( AND(fields=(title), KEYWORD(ways, querypos=1, expanded)), AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded)))), AND(fields=(title), KEYWORD(way, querypos=1, expanded)), OR(fields=(title), KEYWORD(way*, querypos=1, expanded)))",
"children":
[
{
"type":"OR",
"description":"OR( AND(fields=(title), KEYWORD(wayne, querypos=1, expanded)), OR( AND(fields=(title), KEYWORD(ways, querypos=1, expanded)), AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded))))",
"children":
[
{
"type":"AND",
"description":"AND(fields=(title), KEYWORD(wayne, querypos=1, expanded))",
"fields":["title"],
"max_field_pos":0,
"children":
[
{
"type":"KEYWORD",
"word":"wayne",
"querypos":1,
"expanded":true
}
]
},
{
"type":"OR",
"description":"OR( AND(fields=(title), KEYWORD(ways, querypos=1, expanded)), AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded)))",
"children":
[
{
"type":"AND",
"description":"AND(fields=(title), KEYWORD(ways, querypos=1, expanded))",
"fields":["title"],
"max_field_pos":0,
"children":
[
{
"type":"KEYWORD",
"word":"ways",
"querypos":1,
"expanded":true
}
]
},
{
"type":"AND",
"description":"AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded))",
"fields":["title"],
"max_field_pos":0,
"children":
[
{
"type":"KEYWORD",
"word":"wayyy",
"querypos":1,
"expanded":true
}
]
}
]
}
]
},
{
"type":"AND",
"description":"AND(fields=(title), KEYWORD(way, querypos=1, expanded))",
"fields":["title"],
"max_field_pos":0,
"children":
[
{
"type":"KEYWORD",
"word":"way",
"querypos":1,
"expanded":true
}
]
},
{
"type":"OR",
"description":"OR(fields=(title), KEYWORD(way*, querypos=1, expanded))",
"fields":["title"],
"max_field_pos":0,
"children":
[
{
"type":"KEYWORD",
"word":"way*",
"querypos":1,
"expanded":true
}
]
}
]
},
{
"type":"AND",
"description":"AND(fields=(content), KEYWORD(hey, querypos=2))",
"fields":["content"],
"max_field_pos":0,
"children":
[
{
"type":"KEYWORD",
"word":"hey",
"querypos":2
}
]
}
]
}
}
}
POST /search
{
"index": "forum",
"query": {"query_string": "@title way* @content hey"},
"_source": { "excludes":["*"] },
"limit": 1,
"plan": 2
}
{
"took": 33,
"timed_out": false,
"hits": {
"total": 105,
"hits": [
{
"_id": "711651",
"_score": 2539,
"_source": {}
}
]
},
"plan": {
"query": {
"type": "AND",
"children": [
{
"type": "OR",
"children": [
{
"type": "OR",
"children": [
{
"type": "AND",
"fields": [
"title"
],
"max_field_pos": 0,
"children": [
{
"type": "KEYWORD",
"word": "wayne",
"querypos": 1,
"expanded": true
}
]
},
{
"type": "OR",
"children": [
{
"type": "AND",
"fields": [
"title"
],
"max_field_pos": 0,
"children": [
{
"type": "KEYWORD",
"word": "ways",
"querypos": 1,
"expanded": true
}
]
},
{
"type": "AND",
"fields": [
"title"
],
"max_field_pos": 0,
"children": [
{
"type": "KEYWORD",
"word": "wayyy",
"querypos": 1,
"expanded": true
}
]
}
]
}
]
},
{
"type": "AND",
"fields": [
"title"
],
"max_field_pos": 0,
"children": [
{
"type": "KEYWORD",
"word": "way",
"querypos": 1,
"expanded": true
}
]
},
{
"type": "OR",
"fields": [
"title"
],
"max_field_pos": 0,
"children": [
{
"type": "KEYWORD",
"word": "way*",
"querypos": 1,
"expanded": true
}
]
}
]
},
{
"type": "AND",
"fields": [
"content"
],
"max_field_pos": 0,
"children": [
{
"type": "KEYWORD",
"word": "hey",
"querypos": 2
}
]
}
]
}
}
}
POST /search
{
"index": "forum",
"query": {"query_string": "@title way* @content hey"},
"_source": { "excludes":["*"] },
"limit": 1,
"plan": 1
}
{
"took":33,
"timed_out":false,
"hits":
{
"total":105,
"hits":
[
{
"_id":"711651",
"_score":2539,
"_source":{}
}
]
},
"plan":
{
"query":
{
"description":"AND( OR( OR( AND(fields=(title), KEYWORD(wayne, querypos=1, expanded)), OR( AND(fields=(title), KEYWORD(ways, querypos=1, expanded)), AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded)))), AND(fields=(title), KEYWORD(way, querypos=1, expanded)), OR(fields=(title), KEYWORD(way*, querypos=1, expanded))), AND(fields=(content), KEYWORD(hey, querypos=2)))"
}
}
}
See also EXPLAIN QUERY. It displays the execution tree of a full-text query without actually executing the query. Note that when using SHOW PLAN after a query to a real-time table, the result will be based on a random disk/RAM chunk. Therefore, if you have recently modified the table's tokenization settings, or if the chunks vary significantly in terms of dictionaries, etc., you might not get the result you are expecting. Take this into account and consider using EXPLAIN QUERY as well.
query property contains the transformed full-text query tree. Each node contains:
type: node type. Can be AND, OR, PHRASE, KEYWORD, etc.description: query subtree for this node shown as a string (in SHOW PLAN format).children: child nodes, if any.max_field_pos: maximum position within a field.word: transformed keyword. Keyword nodes only.querypos: position of this keyword in a query. Keyword nodes only.excluded: keyword excluded from query. Keyword nodes only.expanded: keyword added by prefix expansion. Keyword nodes only.field_start: keyword must occur at the very start of the field. Keyword nodes only.field_end: keyword must occur at the very end of the field. Keyword nodes only.boost: keyword IDF will be multiplied by this. Keyword nodes only.SHOW PLAN format=dot allows returning the full-text query execution tree in a hierarchical format suitable for visualization by existing tools, such as https://dreampuf.github.io/GraphvizOnline:
MySQL [(none)]> show plan option format=dot\G
*************************** 1. row ***************************
Variable: transformed_tree
Value: digraph "transformed_tree"
{
0 [shape=record,style=filled,bgcolor="lightgrey" label="AND"]
0 -> 1
1 [shape=record,style=filled,bgcolor="lightgrey" label="AND"]
1 -> 2
2 [shape=record label="i | { querypos=1 }"]
0 -> 3
3 [shape=record,style=filled,bgcolor="lightgrey" label="AND"]
3 -> 4
4 [shape=record label="me | { querypos=2 }"]
}

SHOW TABLE STATUS is an SQL statement that displays various per-table statistics.
The syntax is:
SHOW TABLE index_name STATUS
Depending on index type, displayed statistic includes different set of rows:
index_type.index_type, query_time_1min, query_time_5min,query_time_15min,query_time_total, exact_query_time_1min, exact_query_time_5min, exact_query_time_15min, exact_query_time_total, found_rows_1min, found_rows_5min, found_rows_15min, found_rows_total.percolate: index_type, stored_queries, ram_bytes, disk_bytes, max_stack_need, average_stack_base, desired_thread_stack, tid, tid_saved, query_time_1min, query_time_5min,query_time_15min,query_time_total, exact_query_time_1min, exact_query_time_5min, exact_query_time_15min, exact_query_time_total, found_rows_1min, found_rows_5min, found_rows_15min, found_rows_total.
plain: index_type, indexed_documents, indexed_bytes, may be set of field_tokens_* and total_tokens, ram_bytes, disk_bytes, disk_mapped, disk_mapped_cached, disk_mapped_doclists, disk_mapped_cached_doclists, disk_mapped_hitlists, disk_mapped_cached_hitlists, killed_documents, killed_rate, query_time_1min, query_time_5min,query_time_15min,query_time_total, exact_query_time_1min, exact_query_time_5min, exact_query_time_15min, exact_query_time_total, found_rows_1min, found_rows_5min, found_rows_15min, found_rows_total.
index_type, indexed_documents, indexed_bytes, may be set of field_tokens_* and total_tokens, ram_bytes, disk_bytes, disk_mapped, disk_mapped_cached, disk_mapped_doclists, disk_mapped_cached_doclists, disk_mapped_hitlists, disk_mapped_cached_hitlists, killed_documents, killed_rate, ram_chunk, ram_chunk_segments_count, disk_chunks, mem_limit, mem_limit_rate, ram_bytes_retired, locked, tid, tid_saved, query_time_1min, query_time_5min,query_time_15min,query_time_total, exact_query_time_1min, exact_query_time_5min, exact_query_time_15min, exact_query_time_total, found_rows_1min, found_rows_5min, found_rows_15min, found_rows_total.Here is the meaning of these values:
index_type: currently one of disk, rt, percolate, template, and distributed.indexed_documents: number of indexed documents.indexed_bytes: overall size of indexed text. Notice, this value is not strict, since in full-text index that is impossible to strictly recover back stored text to measure it.stored_queries: number of percolate queries, stored in the table.field_tokens_XXX: optional, total per-field lengths (in tokens) across the entire table (used internally for BM25A and BM25F ranking functions). Only available for tables built with index_field_lengths=1.total_tokens: optional, overall sum of all field_tokens_XXX.ram_bytes: total RAM occupied by table.disk_bytes: total disk space, occupied by table.disk_mapped: total size of file mappings.disk_mapped_cached: total size of file mappings actually cached in RAM.disk_mapped_doclists and disk_mapped_cached_doclists: portion of total and cached mappings belonging to document lists.disk_mapped_hitlists and disk_mapped_cached_hitlists: portion of total and cached mappings belonging to hit lists. Doclists and hitlists values are shown separately since they're typically large (e.g., about 90% of the whole table's size).killed_documents and killed_rate: the first indicates the number of deleted documents and the rate of deleted/indexed. Technically, deleting a document means suppressing it in search output, but it still physically exists in the table and will only be purged after merging/optimizing the table.ram_chunk: size of the RAM chunk of real-time or percolate table.ram_chunk_segments_count: RAM chunk is internally composed of segments, typically no more than 32. This line shows the current count.disk_chunks: number of disk chunks in the real-time table.mem_limit: actual value of rt_mem_limit for the table.mem_limit_rate: the rate at which the RAM chunk will be flushed as a disk chunk, e.g., if rt_mem_limit is 128M and the rate is 50%, a new disk chunk will be saved when the RAM chunk exceeds 64M.ram_bytes_retired: represents the size of garbage in RAM chunks (e.g., deleted or replaced documents not yet permanently removed).locked: a value greater than 0 indicates that the table is currently locked by FREEZE. The number represents how many times the table has been frozen. For instance, a table might be frozen by manticore-backup and then frozen again by replication. It should only be completely unfrozen when no other process requires it to be frozen.max_stack_need: stack space we need to calculate most complex from the stored percolate queries. That is dynamic value, depends on build details as compiler, optimization, hardware, etc.average_stack_base: stack space which is usually occupied on start of calculation of percolate query.desired_thread_stack: sum of above values, rounded up to 128 bytes edge. If this value is greater than thread_stack, you may not execute call pq over this table, as some stored queries will fail. Default thread_stack value is 1M (which is 1048576); other values should be configured.tid and tid_saved: represent the state of saving the table. tid increases with each change (transaction). tid_saved shows the max tid of the state saved in a RAM chunk in <table>.ram file. When the numbers differ, some changes exist only in RAM and are also backed by binlog (if enabled). Performing FLUSH TABLE or scheduling periodic flushing saves these changes. After flushing, the binlog is cleared, and tid_saved represents the new actual state.query_time_*, exact_query_time_*: query execution time statistics for the last 1 minute, 5 minutes, 15 minutes, and total since server start; data is encapsulated as a JSON object, including the number of queries and min, max, avg, 95, and 99 percentile values.found_rows_*: statistics of rows found by queries; provided for the last 1 minute, 5 minutes, 15 minutes, and total since server start; data is encapsulated as a JSON object, including the number of queries and min, max, avg, 95, and 99 percentile values.mysql> SHOW TABLE statistic STATUS;
+-----------------------------+--------------------------------------------------------------------------+
| Variable_name | Value |
+-----------------------------+--------------------------------------------------------------------------+
| index_type | rt |
| indexed_documents | 146000 |
| indexed_bytes | 149504000 |
| ram_bytes | 87674788 |
| disk_bytes | 1762811 |
| disk_mapped | 794147 |
| disk_mapped_cached | 802816 |
| disk_mapped_doclists | 0 |
| disk_mapped_cached_doclists | 0 |
| disk_mapped_hitlists | 0 |
| disk_mapped_cached_hitlists | 0 |
| killed_documents | 0 |
| killed_rate | 0.00% |
| ram_chunk | 86865484 |
| ram_chunk_segments_count | 24 |
| disk_chunks | 1 |
| mem_limit | 134217728 |
| mem_limit_rate | 95.00% |
| ram_bytes_retired | 0 |
| locked | 0 |
| tid | 0 |
| tid_saved | 0 |
| query_time_1min | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| query_time_5min | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| query_time_15min | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| query_time_total | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| found_rows_1min | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| found_rows_5min | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| found_rows_15min | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
| found_rows_total | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"} |
+-----------------------------+--------------------------------------------------------------------------+
29 rows in set (0.00 sec)
$index->status();
Array(
[index_type] => rt
[indexed_documents] => 3
[indexed_bytes] => 0
[ram_bytes] => 6678
[disk_bytes] => 611
[ram_chunk] => 990
[ram_chunk_segments_count] => 2
[mem_limit] => 134217728
[ram_bytes_retired] => 0
[locked] => 0
[tid] => 15
[query_time_1min] => {"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}
[query_time_5min] => {"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}
[query_time_15min] => {"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}
[query_time_total] => {"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}
[found_rows_1min] => {"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}
[found_rows_5min] => {"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}
[found_rows_15min] => {"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}
[found_rows_total] => {"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}
)
utilsApi.sql('SHOW TABLE statistic STATUS')
{u'columns': [{u'Key': {u'type': u'string'}},
{u'Value': {u'type': u'string'}}],
u'data': [
{u'Key': u'index_type', u'Value': u'rt'}
{u'Key': u'indexed_documents', u'Value': u'3'}
{u'Key': u'indexed_bytes', u'Value': u'0'}
{u'Key': u'ram_bytes', u'Value': u'6678'}
{u'Key': u'disk_bytes', u'Value': u'611'}
{u'Key': u'ram_chunk', u'Value': u'990'}
{u'Key': u'ram_chunk_segments_count', u'Value': u'2'}
{u'Key': u'mem_limit', u'Value': u'134217728'}
{u'Key': u'ram_bytes_retired', u'Value': u'0'}
{u'Key': u'locked', u'Value': u'0'}
{u'Key': u'tid', u'Value': u'15'}
{u'Key': u'query_time_1min', u'Value': u'{"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}'}
{u'Key': u'query_time_5min', u'Value': u'{"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}'}
{u'Key': u'query_time_15min', u'Value': u'{"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}'}
{u'Key': u'query_time_total', u'Value': u'{"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}'}
{u'Key': u'found_rows_1min', u'Value': u'{"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}'}
{u'Key': u'found_rows_5min', u'Value': u'{"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}'}
{u'Key': u'found_rows_15min', u'Value': u'{"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}'}
{u'Key': u'found_rows_total', u'Value': u'{"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}'}],
u'error': u'',
u'total': 0,
u'warning': u''}
res = await utilsApi.sql('SHOW TABLE statistic STATUS');
{"columns": [{"Key": {"type": "string"}},
{"Value": {"type": "string"}}],
"data": [
{"Key": "index_type", "Value": "rt"}
{"Key": "indexed_documents", "Value": "3"}
{"Key": "indexed_bytes", "Value": "0"}
{"Key": "ram_bytes", "Value": "6678"}
{"Key": "disk_bytes", "Value": "611"}
{"Key": "ram_chunk", "Value": "990"}
{"Key": "ram_chunk_segments_count", "Value": "2"}
{"Key": "mem_limit", "Value": "134217728"}
{"Key": "ram_bytes_retired", "Value": "0"}
{"Key": "locked", "Value": "0"}
{"Key": "tid", "Value": "15"}
{"Key": "query_time_1min", "Value": "{"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}"}
{"Key": "query_time_5min", "Value": "{"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}"}
{"Key": "query_time_15min", "Value": "{"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}"}
{"Key": "query_time_total", "Value": "{"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}"}
{"Key": "found_rows_1min", "Value": "{"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}"}
{"Key": "found_rows_5min", "Value": "{"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}"}
{"Key": "found_rows_15min", "Value": "{"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}"}
{"Key": "found_rows_total", "Value": "{"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}"}],
"error": "",
"total": 0,
"warning": ""}
utilsApi.sql("SHOW TABLE statistic STATUS");
{columns=[{ Key : { type=string }},
{ Value : { type=string }}],
data : [
{ Key=index_type, Value=rt}
{ Key=indexed_documents, Value=3}
{ Key=indexed_bytes, Value=0}
{ Key=ram_bytes, Value=6678}
{ Key=disk_bytes, Value=611}
{ Key=ram_chunk, Value=990}
{ Key=ram_chunk_segments_count, Value=2}
{ Key=mem_limit, Value=134217728}
{ Key=ram_bytes_retired, Value=0}
{ Key=locked, Value=0}
{ Key=tid, Value=15}
{ Key=query_time_1min, Value={queries:1, avg_sec:0.001, min_sec:0.001, max_sec:0.001, pct95_sec:0.001, pct99_sec:0.001}}
{ Key=query_time_5min, Value={queries:1, avg_sec:0.001, min_sec:0.001, max_sec:0.001, pct95_sec:0.001, pct99_sec:0.001}}
{ Key=query_time_15min, Value={queries:1, avg_sec:0.001, min_sec:0.001, max_sec:0.001, pct95_sec:0.001, pct99_sec:0.001}}
{ Key=query_time_total, Value={queries:1, avg_sec:0.001, min_sec:0.001, max_sec:0.001, pct95_sec:0.001, pct99_sec:0.001}}
{ Key=found_rows_1min, Value={queries:1, avg:3, min:3, max:3, pct95:3, pct99:3}}
{ Key=found_rows_5min, Value={queries:1, avg:3, min:3, max:3, pct95:3, pct99:3}}
{ Key=found_rows_15min, Value={queries:1, avg:3, min:3, max:3, pct95:3, pct99:3}}
{ Key=found_rows_total, Value={queries:1, avg:3, min:3, max:3, pct95:3, pct99:3}}],
error= ,
total=0,
warning= }
utilsApi.Sql("SHOW TABLE statistic STATUS");
{columns=[{ Key : { type=string }},
{ Value : { type=string }}],
data : [
{ Key=index_type, Value=rt}
{ Key=indexed_documents, Value=3}
{ Key=indexed_bytes, Value=0}
{ Key=ram_bytes, Value=6678}
{ Key=disk_bytes, Value=611}
{ Key=ram_chunk, Value=990}
{ Key=ram_chunk_segments_count, Value=2}
{ Key=mem_limit, Value=134217728}
{ Key=ram_bytes_retired, Value=0}
{ Key=locked, Value=0}
{ Key=tid, Value=15}
{ Key=query_time_1min, Value={queries:1, avg_sec:0.001, min_sec:0.001, max_sec:0.001, pct95_sec:0.001, pct99_sec:0.001}}
{ Key=query_time_5min, Value={queries:1, avg_sec:0.001, min_sec:0.001, max_sec:0.001, pct95_sec:0.001, pct99_sec:0.001}}
{ Key=query_time_15min, Value={queries:1, avg_sec:0.001, min_sec:0.001, max_sec:0.001, pct95_sec:0.001, pct99_sec:0.001}}
{ Key=query_time_total, Value={queries:1, avg_sec:0.001, min_sec:0.001, max_sec:0.001, pct95_sec:0.001, pct99_sec:0.001}}
{ Key=found_rows_1min, Value={queries:1, avg:3, min:3, max:3, pct95:3, pct99:3}}
{ Key=found_rows_5min, Value={queries:1, avg:3, min:3, max:3, pct95:3, pct99:3}}
{ Key=found_rows_15min, Value={queries:1, avg:3, min:3, max:3, pct95:3, pct99:3}}
{ Key=found_rows_total, Value={queries:1, avg:3, min:3, max:3, pct95:3, pct99:3}}],
error="" ,
total=0,
warning="" }
res = await utilsApi.sql('SHOW TABLE statistic STATUS');
{
"columns":
[{
"Key": {"type": "string"}
},
{
"Value": {"type": "string"}
}],
"data":
[
{"Key": "index_type", "Value": "rt"}
{"Key": "indexed_documents", "Value": "3"}
{"Key": "indexed_bytes", "Value": "0"}
{"Key": "ram_bytes", "Value": "6678"}
{"Key": "disk_bytes", "Value": "611"}
{"Key": "ram_chunk", "Value": "990"}
{"Key": "ram_chunk_segments_count", "Value": "2"}
{"Key": "mem_limit", "Value": "134217728"}
{"Key": "ram_bytes_retired", "Value": "0"}
{"Key": "locked", "Value": "0"}
{"Key": "tid", "Value": "15"}
{"Key": "query_time_1min", "Value": "{"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}"}
{"Key": "query_time_5min", "Value": "{"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}"}
{"Key": "query_time_15min", "Value": "{"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}"}
{"Key": "query_time_total", "Value": "{"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}"}
{"Key": "found_rows_1min", "Value": "{"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}"}
{"Key": "found_rows_5min", "Value": "{"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}"}
{"Key": "found_rows_15min", "Value": "{"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}"}
{"Key": "found_rows_total", "Value": "{"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}"}
],
"error": "",
"total": 0,
"warning": ""
}
apiClient.UtilsAPI.Sql(context.Background()).Body("SHOW TABLE statistic STATUS").Execute()
{
"columns":
[{
"Key": {"type": "string"}
},
{
"Value": {"type": "string"}
}],
"data":
[
{"Key": "index_type", "Value": "rt"}
{"Key": "indexed_documents", "Value": "3"}
{"Key": "indexed_bytes", "Value": "0"}
{"Key": "ram_bytes", "Value": "6678"}
{"Key": "disk_bytes", "Value": "611"}
{"Key": "ram_chunk", "Value": "990"}
{"Key": "ram_chunk_segments_count", "Value": "2"}
{"Key": "mem_limit", "Value": "134217728"}
{"Key": "ram_bytes_retired", "Value": "0"}
{"Key": "locked", "Value": "0"}
{"Key": "tid", "Value": "15"}
{"Key": "query_time_1min", "Value": "{"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}"}
{"Key": "query_time_5min", "Value": "{"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}"}
{"Key": "query_time_15min", "Value": "{"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}"}
{"Key": "query_time_total", "Value": "{"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}"}
{"Key": "found_rows_1min", "Value": "{"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}"}
{"Key": "found_rows_5min", "Value": "{"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}"}
{"Key": "found_rows_15min", "Value": "{"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}"}
{"Key": "found_rows_total", "Value": "{"queries":1, "avg":3, "min":3, "max":3, "pct95":3, "pct99":3}"}
],
"error": "",
"total": 0,
"warning": ""
}
SHOW TABLE SETTINGS is an SQL statement that displays per-table settings in a format compatible with the config file.
The syntax is:
SHOW TABLE index_name[.N | CHUNK N] SETTINGS
The output resembles the --dumpconfig option of the indextool utility. The report provides a breakdown of all table settings, including tokenizer and dictionary options.
SHOW TABLE forum SETTINGS;
+---------------+-----------------------------------------------------------------------------------------------------------+
| Variable_name | Value |
+---------------+-----------------------------------------------------------------------------------------------------------+
| settings | min_prefix_len = 3
charset_table = 0..9, A..Z->a..z, _, -, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F |
+---------------+-----------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
You can also specify a particular chunk number to view the settings of a specific chunk in an RT table. The numbering is 0-based.
SHOW TABLE forum CHUNK 0 SETTINGS;
+---------------+-----------------------------------------------------------------------------------------------------------+
| Variable_name | Value |
+---------------+-----------------------------------------------------------------------------------------------------------+
| settings | min_prefix_len = 3
charset_table = 0..9, A..Z->a..z, _, -, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F |
+---------------+-----------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
The below settings are to be used in the searchd section of the Manticore Search configuration file to control the server's behavior. Below is a summary of each setting:
This setting sets instance-wide defaults for access_plain_attrs. It is optional, with a default value of mmap_preread.
The access_plain_attrs directive allows you to define the default value of access_plain_attrs for all tables managed by this searchd instance. Per-table directives have higher priority and will override this instance-wide default, providing more fine-grained control.
This setting sets instance-wide defaults for access_blob_attrs. It is optional, with a default value of mmap_preread.
The access_blob_attrs directive allows you to define the default value of access_blob_attrs for all tables managed by this searchd instance. Per-table directives have higher priority and will override this instance-wide default, providing more fine-grained control.
This setting sets instance-wide defaults for access_doclists. It is optional, with a default value of file.
The access_doclists directive allows you to define the default value of access_doclists for all tables managed by this searchd instance. Per-table directives have higher priority and will override this instance-wide default, providing more fine-grained control.
This setting sets instance-wide defaults for access_hitlists. It is optional, with a default value of file.
The access_hitlists directive allows you to define the default value of access_hitlists for all tables managed by this searchd instance. Per-table directives have higher priority and will override this instance-wide default, providing more fine-grained control.
This setting sets instance-wide defaults for access_dict. It is optional, with a default value of mmap_preread.
The access_dict directive allows you to define the default value of access_dict for all tables managed by this searchd instance. Per-table directives have higher priority and will override this instance-wide default, providing more fine-grained control.
This setting sets instance-wide defaults for the agent_connect_timeout parameter.
This setting sets instance-wide defaults for the agent_query_timeout parameter. It can be overridden on a per-query basis using the OPTION agent_query_timeout=XXX clause.
This setting is an integer that specifies how many times Manticore will attempt to connect and query remote agents through a distributed table before reporting a fatal query error. The default value is 0 (i.e., no retries). You can also set this value on a per-query basis using the OPTION retry_count=XXX clause. If a per-query option is provided, it will override the value specified in the configuration.
Note that if you use agent mirrors in the definition of your distributed table, the server will select a different mirror for each connection attempt according to the chosen ha_strategy. In this case, the agent_retry_count will be aggregated for all mirrors in a set.
For example, if you have 10 mirrors and set agent_retry_count=5, the server will retry up to 50 times, assuming an average of 5 tries for each of the 10 mirrors (with the ha_strategy = roundrobin option, this will be the case).
However, the value provided as the retry_count option for the agent serves as an absolute limit. In other words, the [retry_count=2] option in the agent definition always means a maximum of 2 attempts, regardless of whether you have specified 1 or 10 mirrors for the agent.
This setting is an integer in milliseconds (or special_suffixes) that specifies the delay before Manticore retries querying a remote agent in case of failure. This value is only relevant when a non-zero agent_retry_count or non-zero per-query retry_count is specified. The default value is 500. You can also set this value on a per-query basis using the OPTION retry_delay=XXX clause. If a per-query option is provided, it will override the value specified in the configuration.
When using Update to modify document attributes in real-time, the changes are first written to an in-memory copy of the attributes. These updates occur in a memory-mapped file, meaning the OS decides when to write the changes to disk. Upon normal shutdown of searchd (triggered by a SIGTERM signal), all changes are forced to be written to disk.
You can also instruct searchd to periodically write these changes back to disk to prevent data loss. The interval between these flushes is determined by attr_flush_period, specified in seconds (or special_suffixes).
By default, the value is 0, which disables periodic flushing. However, flushing will still occur during a normal shutdown.
attr_flush_period = 900 # persist updates to disk every 15 minutes
This setting controls the automatic OPTIMIZE process for table compaction.
By default table compaction occurs automatically. You can modify this behavior with the auto_optimize setting:
OPTIMIZE manually)# of CPU cores * 2 * NNote that toggling auto_optimize on or off doesn't prevent you from running OPTIMIZE TABLE manually.
auto_optimize = 0 # disable automatic OPTIMIZE
auto_optimize = 2 # OPTIMIZE starts at 16 chunks (on 4 cpu cores server)
Manticore supports the automatic creation of tables that don't yet exist but are specified in INSERT statements. This feature is enabled by default. To disable it, set auto_schema = 0 explicitly in your configuration. To re-enable it, set auto_schema = 1 or remove the auto_schema setting from the configuration.
Keep in mind that the /bulk HTTP endpoint does not support automatic table creation.
auto_schema = 0 # disable automatic table creation
auto_schema = 1 # enable automatic table creation
This setting controls the binary log transaction flush/sync mode. It is optional, with a default value of 2 (flush every transaction, sync every second).
The directive determines how frequently the binary log will be flushed to the OS and synced to disk. There are three supported modes:
For those familiar with MySQL and InnoDB, this directive is similar to innodb_flush_log_at_trx_commit. In most cases, the default hybrid mode 2 provides a nice balance of speed and safety, with full RT table data protection against server crashes and some protection against hardware ones.
binlog_flush = 1 # ultimate safety, low speed
This setting controls the maximum binary log file size. It is optional, with a default value of 268435456, or 256 MB.
A new binlog file will be forcibly opened once the current binlog file reaches this size limit. This results in a finer granularity of logs and can lead to more efficient binlog disk usage under certain borderline workloads. A value of 0 indicates that the binlog file should not be reopened based on size.
binlog_max_log_size = 16M
This setting determines the path for binary log (also known as transaction log) files. It is optional, with a default value of the build-time configured data directory (e.g., /var/lib/manticore/data/binlog.* in Linux).
Binary logs are used for crash recovery of RT table data and for attribute updates of plain disk indices that would otherwise only be stored in RAM until flush. When logging is enabled, every transaction COMMIT-ted into an RT table is written into a log file. Logs are then automatically replayed on startup after an unclean shutdown, recovering the logged changes.
The binlog_path directive specifies the location of binary log files. It should only contain the path; searchd will create and unlink multiple binlog.* files in the directory as necessary (including binlog data, metadata, and lock files, etc).
An empty value disables binary logging, which improves performance but puts the RT table data at risk.
binlog_path = # disable logging
binlog_path = /var/lib/manticore/data # /var/lib/manticore/data/binlog.001 etc will be created
This setting determines the path to the Manticore Buddy binary. It is optional, with a default value being the build-time configured path, which varies across different operating systems. Typically, you don't need to modify this setting. However, it may be useful if you wish to run Manticore Buddy in debug mode, make changes to Manticore Buddy, or implement a new plugin. In the latter case, you can git clone Buddy from https://github.com/manticoresoftware/manticoresearch-buddy, add a new plugin to the directory ./plugins/, and run composer install --prefer-source for easier development after you change the directory to the Buddy source.
To ensure you can run composer, your machine must have PHP 8.2 or higher installed with the following extensions:
--enable-dom
--with-libxml
--enable-tokenizer
--enable-xml
--enable-xmlwriter
--enable-xmlreader
--enable-simplexml
--enable-phar
--enable-bcmath
--with-gmp
--enable-debug
--with-mysqli
--enable-mysqlnd
You can also opt for the special manticore-executor-dev version for Linux amd64 available in the releases, for example: https://github.com/manticoresoftware/executor/releases/tag/v1.0.13
If you go this route, remember to link the dev version of the manticore executor to /usr/bin/php.
buddy_path = manticore-executor -n /usr/share/manticore/modules/manticore-buddy/src/main.php --debug # use the default Manticore Buddy in Linux, but run it in debug mode
buddy_path = manticore-executor -n /opt/homebrew/share/manticore/modules/manticore-buddy/bin/manticore-buddy/src/main.php --debug # use the default Manticore Buddy in MacOS arm64, but run it in debug mode
buddy_path = manticore-executor -n /Users/username/manticoresearch-buddy/src/main.php --debug # use Manticore Buddy from a non-default location
This setting determines the maximum time to wait between requests (in seconds or special_suffixes) when using persistent connections. It is optional, with a default value of five minutes.
client_timeout = 1h
Server libc locale. Optional, default is C.
Specifies the libc locale, affecting the libc-based collations. Refer to collations section for the details.
collation_libc_locale = fr_FR
Default server collation. Optional, default is libc_ci.
Specifies the default collation used for incoming requests. The collation can be overridden on a per-query basis. Refer to collations section for the list of available collations and other details.
collation_server = utf8_ci
When specified, this setting enables the real-time mode, which is an imperative way of managing data schema. The value should be a path to the directory where you want to store all your tables, binary logs, and everything else needed for the proper functioning of Manticore Search in this mode.
Indexing of plain tables is not allowed when the data_dir is specified. Read more about the difference between the RT mode and the plain mode in this section.
data_dir = /var/lib/manticore
This setting specifies the maximum size of document blocks from document storage that are held in memory. It is optional, with a default value of 16m (16 megabytes).
When stored_fields is used, document blocks are read from disk and uncompressed. Since every block typically holds several documents, it may be reused when processing the next document. For this purpose, the block is held in a server-wide cache. The cache holds uncompressed blocks.
docstore_cache_size = 8m
Default attribute storage engine used when creating tables in RT mode. Can be rowwise (default) or columnar.
engine = columnar
This setting determines the maximum number of expanded keywords for a single wildcard. It is optional, with a default value of 0 (no limit).
When performing substring searches against tables built with dict = keywords enabled, a single wildcard may potentially result in thousands or even millions of matched keywords (think of matching a* against the entire Oxford dictionary). This directive allows you to limit the impact of such expansions. Setting expansion_limit = N restricts expansions to no more than N of the most frequent matching keywords (per each wildcard in the query).
expansion_limit = 16
This setting determines the maximum number of documents in the expanded keyword that allows merging all such keywords together. It is optional, with a default value of 32.
When performing substring searches against tables built with dict = keywords enabled, a single wildcard may potentially result in thousands or even millions of matched keywords. This directive allows you to increase the limit of how many keywords will merge together to speed up matching but uses more memory in the search.
expansion_merge_threshold_docs = 1024
This setting determines the maximum number of hits in the expanded keyword that allows merging all such keywords together. It is optional, with a default value of 256.
When performing substring searches against tables built with dict = keywords enabled, a single wildcard may potentially result in thousands or even millions of matched keywords. This directive allows you to increase the limit of how many keywords will merge together to speed up matching but uses more memory in the search.
expansion_merge_threshold_hits = 512
This setting specifies whether timed grouping in API and SQL will be calculated in the local timezone or in UTC. It is optional, with a default value of 0 (meaning 'local timezone').
By default, all 'group by time' expressions (like group by day, week, month, and year in API, also group by day, month, year, yearmonth, yearmonthday in SQL) are done using local time. For example, if you have documents with attributes timed 13:00 utc and 15:00 utc, in the case of grouping, they both will fall into facility groups according to your local timezone setting. If you live in utc, it will be one day, but if you live in utc+10, then these documents will be matched into different group by day facility groups (since 13:00 utc in UTC+10 timezone is 23:00 local time, but 15:00 is 01:00 of the next day). Sometimes such behavior is unacceptable, and it is desirable to make time grouping not dependent on timezone. You can run the server with a defined global TZ environment variable, but it will affect not only grouping but also timestamping in the logs, which may be undesirable as well. Switching 'on' this option (either in config or using SET global statement in SQL) will cause all time grouping expressions to be calculated in UTC, leaving the rest of time-depentend functions (i.e. logging of the server) in local TZ.
This setting specifies the timezone to be used by date/time-related functions. By default, the local timezone is used, but you can specify a different timezone in IANA format (e.g., Europe/Amsterdam).
Note that this setting has no impact on logging, which always operates in the local timezone.
Also, note that if grouping_in_utc is used, the 'group by time' function will still use UTC, while other date/time-related functions will use the specified timezone. Overall, it is not recommended to mix grouping_in_utc and timezone.
You can configure this option either in the config or by using the SET global statement in SQL.
This setting specifies the agent mirror statistics window size, in seconds (or special_suffixes). It is optional, with a default value of 60 seconds.
For a distributed table with agent mirrors in it (see more in agent, the master tracks several different per-mirror counters. These counters are then used for failover and balancing (the master picks the best mirror to use based on the counters). Counters are accumulated in blocks of ha_period_karma seconds.
After beginning a new block, the master may still use the accumulated values from the previous one until the new one is half full. As a result, any previous history stops affecting the mirror choice after 1.5 times ha_period_karma seconds at most.
Even though at most two blocks are used for mirror selection, up to 15 last blocks are stored for instrumentation purposes. These blocks can be inspected using the SHOW AGENT STATUS statement.
ha_period_karma = 2m
This setting configures the interval between agent mirror pings, in milliseconds (or special_suffixes). It is optional, with a default value of 1000 milliseconds.
For a distributed table with agent mirrors in it (see more in agent), the master sends all mirrors a ping command during idle periods. This is to track the current agent status (alive or dead, network roundtrip, etc). The interval between such pings is defined by this directive. To disable pings, set ha_ping_interval to 0.
ha_ping_interval = 3s
The hostname_lookup option defines the strategy for renewing hostnames. By default, the IP addresses of agent host names are cached at server start to avoid excessive access to DNS. However, in some cases, the IP can change dynamically (e.g. cloud hosting) and it may be desirable to not cache the IPs. Setting this option to request disables the caching and queries the DNS for each query. The IP addresses can also be manually renewed using the FLUSH HOSTNAMES command.
The jobs_queue_size setting defines how many "jobs" can be in the queue at the same time. It is unlimited by default.
In most cases, a "job" means one query to a single local table (plain table or a disk chunk of a real-time table). For example, if you have a distributed table consisting of 2 local tables or a real-time table with 2 disk chunks, a search query to either of them will mostly put 2 jobs in the queue. Then, the thread pool (whose size is defined by threads will process them. However, in some cases, if the query is too complex, more jobs can be created. Changing this setting is recommended when max_connections and threads are not enough to find a balance between the desired performance.
The listen_backlog setting determines the length of the TCP listen backlog for incoming connections. This is particularly relevant for Windows builds that process requests one by one. When the connection queue reaches its limit, new incoming connections will be refused.
For non-Windows builds, the default value should work fine, and there is usually no need to adjust this setting.
listen_backlog = 20
This setting lets you specify an IP address and port, or Unix-domain socket path, that Manticore will accept connections on.
The general syntax for listen is:
listen = ( address ":" port | port | path | address ":" port start - port end ) [ ":" protocol [ "_vip" ] [ "_readonly" ] ]
You can specify:
If you specify a port number but not an address, searchd will listen on all network interfaces. Unix path is identified by a leading slash. Port range can be set only for the replication protocol.
You can also specify a protocol handler (listener) to be used for connections on this socket. The listeners are:
Manticore Buddy. Ensure you have a listener of this kind (or an http listener, as mentioned below) to avoid limitations in Manticore functionality.
mysql MySQL protocol for connections from MySQL clients. Note:
If SSL is enabled, you can make an encrypted connection.
replication - replication protocol used for nodes communication. More details can be found in the replication section. You can specify multiple replication listeners, but they must all listen on the same IP; only the ports can be different.
http - same as Not specified. Manticore will accept connections at this port from remote agents and clients via HTTP and HTTPS.https - HTTPS protocol. Manticore will accept only HTTPS connections at this port. More details can be found in section SSL.sphinx - legacy binary protocol. Used to serve connections from remote SphinxSE clients. Some Sphinx API clients implementations (an example is the Java one) require the explicit declaration of the listener.Adding suffix _vip to client protocols (that is, all except replication, for instance mysql_vip or http_vip or just _vip) forces creating a dedicated thread for the connection to bypass different limitations. That's useful for node maintenance in case of severe overload when the server would either stall or not let you connect via a regular port otherwise.
Suffix _readonly sets read-only mode for the listener and limits it to accept only read queries.
listen = localhost
listen = localhost:5000 # listen for remote agents (binary API) and http/https requests on port 5000 at localhost
listen = 192.168.0.1:5000 # listen for remote agents (binary API) and http/https requests on port 5000 at 192.168.0.1
listen = /var/run/manticore/manticore.s # listen for binary API requests on unix socket
listen = /var/run/manticore/manticore.s:mysql # listen for mysql requests on unix socket
listen = 9312 # listen for remote agents (binary API) and http/https requests on port 9312 on any interface
listen = localhost:9306:mysql # listen for mysql requests on port 9306 at localhost
listen = localhost:9307:mysql_readonly # listen for mysql requests on port 9307 at localhost and accept only read queries
listen = 127.0.0.1:9308:http # listen for http requests as well as connections from remote agents (and binary API) on port 9308 at localhost
listen = 192.168.0.1:9320-9328:replication # listen for replication connections on ports 9320-9328 at 192.168.0.1
listen = 127.0.0.1:9443:https # listen for https requests (not http) on port 9443 at 127.0.0.1
listen = 127.0.0.1:9312:sphinx # listen for legacy Sphinx requests (e.g. from SphinxSE) on port 9312 at 127.0.0.1
There can be multiple listen directives. searchd will listen for client connections on all specified ports and sockets. The default config provided in Manticore packages defines listening on ports:
9308 and 9312 for connections from remote agents and non-MySQL based clients9306 for MySQL connections.If you don't specify any listen in the configuration at all, Manticore will wait for connections on:
127.0.0.1:9306 for MySQL clients127.0.0.1:9312 for HTTP/HTTPS and connections from other Manticore nodes and clients based on the Manticore binary API.By default, Linux won't allow you to let Manticore listen on a port below 1024 (e.g. listen = 127.0.0.1:80:http or listen = 127.0.0.1:443:https) unless you run searchd under root. If you still want to be able to start Manticore, so it listens on ports < 1024 under a non-root user, consider doing one of the following (either of these should work):
setcap CAP_NET_BIND_SERVICE=+eip /usr/bin/searchdAmbientCapabilities=CAP_NET_BIND_SERVICE to Manticore's systemd unit and reload the daemon (systemctl daemon-reload).This setting allows the TCP_FASTOPEN flag for all listeners. By default, it is managed by the system but may be explicitly switched off by setting it to '0'.
For general knowledge about the TCP Fast Open extension, please consult with Wikipedia. In short, it allows the elimination of one TCP round-trip when establishing a connection.
In practice, using TFO in many situations may optimize client-agent network efficiency, as if persistent agents are in play, but without holding active connections, and also without limitation for the maximum num of connections.
On modern OS, TFO support is usually switched 'on' at the system level, but this is just a 'capability', not the rule. Linux (as the most progressive) has supported it since 2011, on kernels starting from 3.7 (for the server-side). Windows has supported it from some builds of Windows 10. Other operating systems (FreeBSD, MacOS) are also in the game.
For Linux system server checks variable /proc/sys/net/ipv4/tcp_fastopen and behaves according to it. Bit 0 manages client side, bit 1 rules listeners. By default, the system has this parameter set to 1, i.e., clients enabled, listeners disabled.
The log setting specifies the name of the log file where all searchd run time events will be logged. If not specified, the default name is 'searchd.log'.
Alternatively, you can use the 'syslog' as the file name. In this case, the events will be sent to the syslog daemon. To use the syslog option, you need to configure Manticore with the -–with-syslog option during building.
log = /var/log/searchd.log
Limits the amount of queries per batch. Optional, default is 32.
Makes searchd perform a sanity check of the amount of queries submitted in a single batch when using multi-queries. Set it to 0 to skip the check.
max_batch_queries = 256
Maximum number of simultaneous client connections. Unlimited by default. That is usually noticeable only when using any kind of persistent connections, like cli mysql sessions or persistent remote connections from remote distributed tables. When the limit is exceeded you can still connect to the server using the VIP connection. VIP connections are not counted towards the limit.
max_connections = 10
Instance-wide limit of threads one operation can use. By default, appropriate operations can occupy all CPU cores, leaving no room for other operations. For example, call pq against a considerably large percolate table can utilize all threads for tens of seconds. Setting max_threads_per_query to, say, half of threads will ensure that you can run a couple of such call pq operations in parallel.
You can also set this setting as a session or a global variable during runtime.
Additionally, you can control the behavior on a per-query basis with the help of the threads OPTION.
max_threads_per_query = 4
Maximum allowed per-query filter count. This setting is only used for internal sanity checks and does not directly affect RAM usage or performance. Optional, the default is 256.
max_filters = 1024
Maximum allowed per-filter values count. This setting is only used for internal sanity checks and does not directly affect RAM usage or performance. Optional, the default is 4096.
max_filter_values = 16384
The maximum number of files that the server is allowed to open is called the "soft limit". Note that serving large fragmented real-time tables may require this limit to be set high, as each disk chunk may occupy a dozen or more files. For example, a real-time table with 1000 chunks may require thousands of files to be opened simultaneously. If you encounter the error 'Too many open files' in the logs, try adjusting this option, as it may help resolve the issue.
There is also a "hard limit" that cannot be exceeded by the option. This limit is defined by the system and can be changed in the file /etc/security/limits.conf on Linux. Other operating systems may have different approaches, so consult your manuals for more information.
max_open_files = 10000
Apart from direct numeric values, you can use the magic word 'max' to set the limit equal to the available current hard limit.
max_open_files = max
Maximum allowed network packet size. This setting limits both query packets from clients and response packets from remote agents in a distributed environment. Only used for internal sanity checks, it does not directly affect RAM usage or performance. Optional, the default is 8M.
max_packet_size = 32M
A server version string to return via the MySQL protocol. Optional, the default is empty (returns the Manticore version).
Several picky MySQL client libraries depend on a particular version number format used by MySQL, and moreover, sometimes choose a different execution path based on the reported version number (rather than the indicated capabilities flags). For instance, Python MySQLdb 1.2.2 throws an exception when the version number is not in X.Y.ZZ format; MySQL .NET connector 6.3.x fails internally on version numbers 1.x along with a certain combination of flags, etc. To work around that, you can use the mysql_version_string directive and have searchd report a different version to clients connecting over the MySQL protocol. (By default, it reports its own version.)
mysql_version_string = 5.0.37
Number of network threads, the default is 1.
This setting is useful for extremely high query rates when just one thread is not enough to manage all the incoming queries.
Controls the busy loop interval of the network thread. The default is -1, and it can be set to -1, 0, or a positive integer.
In cases where the server is configured as a pure master and just routes requests to agents, it is important to handle requests without delays and not allow the network thread to sleep. There is a busy loop for that. After an incoming request, the network thread uses CPU poll for 10 * net_wait_tm milliseconds if net_wait_tm is a positive number or polls only with the CPU if net_wait_tm is 0. Also, the busy loop can be disabled with net_wait_tm = -1 - in this case, the poller sets the timeout to the actual agent's timeouts on the system polling call.
WARNING: A CPU busy loop actually loads the CPU core, so setting this value to any non-default value will cause noticeable CPU usage even with an idle server.
Defines how many clients are accepted on each iteration of the network loop. Default is 0 (unlimited), which should be fine for most users. This is a fine-tuning option to control the throughput of the network loop in high load scenarios.
Defines how many requests are processed on each iteration of the network loop. The default is 0 (unlimited), which should be fine for most users. This is a fine-tuning option to control the throughput of the network loop in high load scenarios.
Network client request read/write timeout, in seconds (or special_suffixes). Optional, the default is 5 seconds. searchd will forcibly close a client connection which fails to send a query or read a result within this timeout.
Note also the reset_network_timeout_on_packet parameter. This parameter alters the behavior of network_timeout from applying to the entire query or result to individual packets instead. Typically, a query/result fits within one or two packets. However, in cases where a large amount of data is required, this parameter can be invaluable in maintaining active operations.
network_timeout = 10s
This setting allows you to specify the network address of the node. By default, it is set to the replication listen ddress. This is correct in most cases; however, there are situations where you have to specify it manually:
node_address = 10.101.0.10
This setting determines whether to allow queries with only the negation full-text operator. Optional, the default is 0 (fail queries with only the NOT operator).
not_terms_only_allowed = 1
Sets the default table compaction threshold. Read more here - Number of optimized disk chunks. This setting can be overridden with the per-query option cutoff. It can also be changed dynamically via SET GLOBAL.
optimize_cutoff = 4
This setting determines the maximum number of simultaneous persistent connections to remote persistent agents. Each time an agent defined under agent_persistent is connected, we try to reuse an existing connection (if any), or connect and save the connection for future use. However, in some cases, it makes sense to limit the number of such persistent connections. This directive defines the limit. It affects the number of connections to each agent's host across all distributed tables.
It is reasonable to set the value equal to or less than the max_connections option in the agent's config.
persistent_connections_limit = 29 # assume that each host of agents has max_connections = 30 (or 29).
pid_file is a mandatory configuration option in Manticore search that specifies the path of the file where the process ID of the searchd server is stored.
The searchd process ID file is re-created and locked on startup, and contains the head server process ID while the server is running. It is unlinked on server shutdown.
The purpose of this file is to enable Manticore to perform various internal tasks, such as checking whether there is already a running instance of searchd, stopping searchd, and notifying it that it should rotate the tables. The file can also be used for external automation scripts.
pid_file = /var/run/manticore/searchd.pid
Costs for the query time prediction model, in nanoseconds. Optional, the default is doc=64, hit=48, skip=2048, match=64.
predicted_time_costs = doc=128, hit=96, skip=4096, match=128
Terminating queries before completion based on their execution time (with the max query time setting) is a nice safety net, but it comes with an inherent drawback: indeterministic (unstable) results. That is, if you repeat the very same (complex) search query with a time limit several times, the time limit will be hit at different stages, and you will get different result sets.
SELECT … OPTION max_query_time
SetMaxQueryTime()
There is a new option, SELECT … OPTION max_predicted_time, that lets you limit the query time and get stable, repeatable results. Instead of regularly checking the actual current time while evaluating the query, which is indeterministic, it predicts the current running time using a simple linear model instead:
predicted_time =
doc_cost * processed_documents +
hit_cost * processed_hits +
skip_cost * skiplist_jumps +
match_cost * found_matches
The query is then terminated early when the predicted_time reaches a given limit.
Of course, this is not a hard limit on the actual time spent (it is, however, a hard limit on the amount of processing work done), and a simple linear model is in no way an ideally precise one. So the wall clock time may be either below or over the target limit. However, the error margins are quite acceptable: for instance, in our experiments with a 100 msec target limit, the majority of the test queries fell into a 95 to 105 msec range, and all the queries were in an 80 to 120 msec range. Also, as a nice side effect, using the modeled query time instead of measuring the actual run time results in somewhat fewer gettimeofday() calls, too.
No two server makes and models are identical, so the predicted_time_costs directive lets you configure the costs for the model above. For convenience, they are integers, counted in nanoseconds. (The limit in max_predicted_time is counted in milliseconds, and having to specify cost values as 0.000128 ms instead of 128 ns is somewhat more error-prone.) It is not necessary to specify all four costs at once, as the missed ones will take the default values. However, we strongly suggest specifying all of them for readability.
The preopen_tables configuration directive specifies whether to forcibly preopen all tables on startup. The default value is 1, which means that all tables will be preopened regardless of the per-table preopen setting. If set to 0, the per-table settings can take effect, and they will default to 0.
Pre-opening tables can prevent races between search queries and rotations that can cause queries to fail occasionally. However, it also uses more file handles. In most scenarios, it is recommended to preopen tables.
Here's an example configuration:
preopen_tables = 1
The pseudo_sharding configuration option enables parallelization of search queries to local plain and real-time tables, regardless of whether they are queried directly or through a distributed table. This feature will automatically parallelize queries to up to the number of threads specified in searchd.threads # of threads.
Note that if your worker threads are already busy, because you have:
then enabling pseudo_sharding may not provide any benefits and may even result in a slight decrease in throughput. If you prioritize higher throughput over lower latency, it's recommended to disable this option.
Enabled by default.
pseudo_sharding = 0
The replication_connect_timeout directive defines the timeout for connecting to a remote node. By default, the value is assumed to be in milliseconds, but it can have another suffix. The default value is 1000 (1 second).
When connecting to a remote node, Manticore will wait for this amount of time at most to complete the connection successfully. If the timeout is reached but the connection has not been established, and retries are enabled, a retry will be initiated.
The replication_query_timeout sets the amount of time that searchd will wait for a remote node to complete a query. The default value is 3000 milliseconds (3 seconds), but can be suffixed to indicate a different unit of time.
After establishing a connection, Manticore will wait for a maximum of replication_query_timeout for the remote node to complete. Note that this timeout is separate from the replication_connect_timeout, and the total possible delay caused by a remote node will be the sum of both values.
This setting is an integer that specifies how many times Manticore will attempt to connect and query a remote node during replication before reporting a fatal query error. The default value is 3.
This setting is an integer in milliseconds (or special_suffixes) that specifies the delay before Manticore retries querying a remote node in case of failure during replication. This value is only relevant when a non-zero value is specified. The default value is 500.
This configuration sets the maximum amount of RAM allocated for cached result sets in bytes. The default value is 16777216, which is equivalent to 16 megabytes. If the value is set to 0, the query cache is disabled. For more information about the query cache, please refer to the query cache for details.
qcache_max_bytes = 16777216
Integer, in milliseconds. The minimum wall time threshold for a query result to be cached. Defaults to 3000, or 3 seconds. 0 means cache everything. Refer to query cache for details. This value also may be expressed with time special_suffixes, but use it with care and don't confuse yourself with the name of the value itself, containing '_msec'.
Integer, in seconds. The expiration period for a cached result set. Defaults to 60, or 1 minute. The minimum possible value is 1 second. Refer to query cache for details. This value also may be expressed with time special_suffixes, but use it with care and don't confuse yourself with the name of the value itself, containing '_sec'.
Query log format. Optional, allowed values are plain and sphinxql, default is sphinxql.
The sphinxql mode logs valid SQL statements. The plain mode logs queries in a plain text format (mostly suitable for purely full-text use cases). This directive allows you to switch between the two formats on search server startup. The log format can also be altered on the fly, using SET GLOBAL query_log_format=sphinxql syntax. Refer to Query logging for more details.
query_log_format = sphinxql
Limit (in milliseconds) that prevents the query from being written to the query log. Optional, default is 0 (all queries are written to the query log). This directive specifies that only queries with execution times that exceed the specified limit will be logged (this value also may be expressed with time special_suffixes, but use it with care and don't confuse yourself with the name of the value itself, containing _msec).
Query log file name. Optional, default is empty (do not log queries). All search queries (such as SELECT ... but not INSERT/REPLACE/UPDATE queries) will be logged in this file. The format is described in Query logging. In case of 'plain' format, you can use 'syslog' as the path to the log file. In this case, all search queries will be sent to the syslog daemon with LOG_INFO priority, prefixed with '[query]' instead of timestamp. To use the syslog option, Manticore must be configured with -–with-syslog on building.
query_log = /var/log/query.log
The query_log_mode directive allows you to set a different permission for the searchd and query log files. By default, these log files are created with 600 permission, meaning that only the user under which the server runs and root users can read the log files.
This directive can be handy if you want to allow other users to read the log files, for example, monitoring solutions running on non-root users.
query_log_mode = 666
The read_buffer_docs directive controls the per-keyword read buffer size for document lists. For every keyword occurrence in every search query, there are two associated read buffers: one for the document list and one for the hit list. This setting lets you control the document list buffer size.
A larger buffer size might increase per-query RAM use, but it could possibly decrease I/O time. It makes sense to set larger values for slow storage, but for storage capable of high IOPS, experimenting should be done in the low values area.
The default value is 256K, and the minimal value is 8K. You may also set read_buffer_docs on a per-table basis, which will override anything set on the server's config level.
read_buffer_docs = 128K
The read_buffer_hits directive specifies the per-keyword read buffer size for hit lists in search queries. By default, the size is 256K and the minimum value is 8K. For every keyword occurrence in a search query, there are two associated read buffers, one for the document list and one for the hit list. Increasing the buffer size can increase per-query RAM use but decrease I/O time. For slow storage, larger buffer sizes make sense, while for storage capable of high IOPS, experimenting should be done in the low values area.
This setting can also be specified on a per-table basis using the read_buffer_hits option in read_buffer_hits which will override the server-level setting.
read_buffer_hits = 128K
Unhinted read size. Optional, default is 32K, minimal 1K
When querying, some reads know in advance exactly how much data is there to be read, but some currently do not. Most prominently, hit list size is not currently known in advance. This setting lets you control how much data to read in such cases. It impacts hit list I/O time, reducing it for lists larger than unhinted read size, but raising it for smaller lists. It does not affect RAM usage because the read buffer will already be allocated. So it should not be greater than read_buffer.
read_unhinted = 32K
Refines the behavior of networking timeouts (such as network_timeout, read_timeout, and agent_query_timeout).
When set to 0, timeouts limit the maximum time for sending the entire request/query.
When set to 1 (default), timeouts limit the maximum time between network activities.
With replication, a node may need to send a large file (for example, 100GB) to another node. Assume the network can transfer data at 1GB/s, with a series of packets of 4-5MB each. To transfer the entire file, you would need 100 seconds. A default timeout of 5 seconds would only allow the transfer of 5GB before the connection is dropped. Increasing the timeout could be a workaround, but it is not scalable (for instance, the next file might be 150GB, leading to failure again). However, with the default reset_network_timeout_on_packet set to 1, the timeout is applied not to the entire transfer but to individual packets. As long as the transfer is in progress (and data is actually being received over the network during the timeout period), it is kept alive. If the transfer gets stuck, such that a timeout occurs between packets, it will be dropped.
Note that if you set up a distributed table, each node — both master and agents — should be tuned. On the master side, agent_query_timeout is affected; on agents, network_timeout is relevant.
reset_network_timeout_on_packet = 0
RT tables RAM chunk flush check period, in seconds (or special_suffixes). Optional, default is 10 hours.
Actively updated RT tables that fully fit in RAM chunks can still result in ever-growing binlogs, impacting disk use and crash recovery time. With this directive, the search server performs periodic flush checks, and eligible RAM chunks can be saved, enabling consequential binlog cleanup. See Binary logging for more details.
rt_flush_period = 3600 # 1 hour
A maximum number of I/O operations (per second) that the RT chunks merge thread is allowed to start. Optional, default is 0 (no limit).
This directive lets you throttle down the I/O impact arising from the OPTIMIZE statements. It is guaranteed that all RT optimization activities will not generate more disk IOPS (I/Os per second) than the configured limit. Limiting rt_merge_iops can reduce search performance degradation caused by merging.
rt_merge_iops = 40
A maximum size of an I/O operation that the RT chunks merge thread is allowed to start. Optional, default is 0 (no limit).
This directive lets you throttle down the I/O impact arising from the OPTIMIZE statements. I/Os larger than this limit will be broken down into two or more I/Os, which will then be accounted for as separate I/Os with regards to the rt_merge_iops limit. Thus, it is guaranteed that all optimization activities will not generate more than (rt_merge_iops * rt_merge_maxiosize) bytes of disk I/O per second.
rt_merge_maxiosize = 1M
Prevents searchd stalls while rotating tables with huge amounts of data to precache. Optional, default is 1 (enable seamless rotation). On Windows systems, seamless rotation is disabled by default.
Tables may contain some data that needs to be precached in RAM. At the moment, .spa, .spb, .spi, and .spm files are fully precached (they contain attribute data, blob attribute data, keyword table, and killed row map, respectively.) Without seamless rotate, rotating a table tries to use as little RAM as possible and works as follows:
searchd waits for all currently running queries to finish;searchd resumes serving queries from the new table.However, if there's a lot of attribute or dictionary data, then the preloading step could take a noticeable amount of time - up to several minutes in the case of preloading 1-5+ GB files.
With seamless rotate enabled, rotation works as follows:
Seamless rotate comes at the cost of higher peak memory usage during the rotation (because both old and new copies of .spa/.spb/.spi/.spm data need to be in RAM while preloading the new copy). Average usage remains the same.
seamless_rotate = 1
This option enables/disables the use of secondary indexes for search queries. It is optional, and the default is 1 (enabled). Note that you don't need to enable it for indexing as it is always enabled as long as the Manticore Columnar Library is installed. The latter is also required for using the indexes when searching. There are three modes available:
0: Disable the use of secondary indexes on search. They can be enabled for individual queries using analyzer hints1: Enable the use of secondary indexes on search. They can be disabled for individual queries using analyzer hintsforce: Same as enable, but any errors during the loading of secondary indexes will be reported, and the whole index will not be loaded into the daemon.Note that secondary indexes are not effective for full-text queries.
secondary_indexes = 1
Integer number that serves as a server identifier used as a seed to generate a unique short UUID for nodes that are part of a replication cluster. The server_id must be unique across the nodes of a cluster and in the range from 0 to 127. If server_id is not set, the MAC address or a random number will be used as a seed for the short UUID.
server_id = 1
searchd --stopwait waiting time, in seconds (or special_suffixes). Optional, default is 60 seconds.
When you run searchd --stopwait your server needs to perform some activities before stopping, such as finishing queries, flushing RT RAM chunks, flushing attributes, and updating the binlog. These tasks require some time. searchd --stopwait will wait up to shutdown_time seconds for the server to finish its jobs. The suitable time depends on your table size and load.
shutdown_timeout = 3m # wait for up to 3 minutes
SHA1 hash of the password required to invoke the 'shutdown' command from a VIP Manticore SQL connection. Without it,debug 'shutdown' subcommand will never cause the server to stop. Note that such simple hashing should not be considered strong protection, as we don't use a salted hash or any kind of modern hash function. It is intended as a fool-proof measure for housekeeping daemons in a local network.
A prefix to prepend to the local file names when generating snippets. Optional, default is the current working folder.
This prefix can be used in distributed snippets generation along with load_files or load_files_scattered options.
Note that this is a prefix and not a path! This means that if a prefix is set to "server1" and the request refers to "file23", searchd will attempt to open "server1file23" (all of that without quotes). So, if you need it to be a path, you have to include the trailing slash.
After constructing the final file path, the server unwinds all relative dirs and compares the final result with the value of snippet_file_prefix. If the result does not begin with the prefix, such a file will be rejected with an error message.
For example, if you set it to /mnt/data and someone calls snippet generation with the file ../etc/passwd as the source, they will get the error message:
File '/mnt/data/../etc/passwd' escapes '/mnt/data/' scope
instead of the content of the file.
Also, with a non-set parameter and reading /etc/passwd, it will actually read /daemon/working/folder/etc/passwd since the default for the parameter is the server's working folder.
Note also that this is a local option; it does not affect the agents in any way. So you can safely set a prefix on a master server. The requests routed to the agents will not be affected by the master's setting. They will, however, be affected by the agent's own settings.
This might be useful, for instance, when the document storage locations (whether local storage or NAS mountpoints) are inconsistent across the servers.
snippets_file_prefix = /mnt/common/server1/
WARNING: If you still want to access files from the FS root, you have to explicitly set
snippets_file_prefixto empty value (bysnippets_file_prefix=line), or to root (bysnippets_file_prefix=/).
Path to a file where the current SQL state will be serialized.
On server startup, this file gets replayed. On eligible state changes (e.g., SET GLOBAL), this file gets rewritten automatically. This can prevent a hard-to-diagnose problem: If you load UDF functions but Manticore crashes, when it gets (automatically) restarted, your UDF and global variables will no longer be available. Using persistent state helps ensure a graceful recovery with no such surprises.
sphinxql_state cannot be used to execute arbitrary commands, such as CREATE TABLE.
sphinxql_state = uservars.sql
Maximum time to wait between requests (in seconds, or special_suffixes) when using the SQL interface. Optional, default is 15 minutes.
sphinxql_timeout = 15m
Path to the SSL Certificate Authority (CA) certificate file (also known as root certificate). Optional, default is empty. When not empty, the certificate in ssl_cert should be signed by this root certificate.
The server uses the CA file to verify the signature on the certificate. The file must be in PEM format.
ssl_ca = keys/ca-cert.pem
Path to the server's SSL certificate. Optional, default is empty.
The server uses this certificate as a self-signed public key to encrypt HTTP traffic over SSL. The file must be in PEM format.
ssl_cert = keys/server-cert.pem
Path to the SSL certificate key. Optional, default is empty.
The server uses this private key to encrypt HTTP traffic over SSL. The file must be in PEM format.
ssl_key = keys/server-key.pem
Max common subtree document cache size, per-query. Optional, default is 0 (disabled).
This setting limits the RAM usage of a common subtree optimizer (see multi-queries). At most, this much RAM will be spent to cache document entries for each query. Setting the limit to 0 disables the optimizer.
subtree_docs_cache = 8M
Max common subtree hit cache size, per-query. Optional, default is 0 (disabled).
This setting limits the RAM usage of a common subtree optimizer (see multi-queries). At most, this much RAM will be spent to cache keyword occurrences (hits) for each query. Setting the limit to 0 disables the optimizer.
subtree_hits_cache = 16M
Number of working threads (or, size of thread pool) for the Manticore daemon. Manticore creates this number of OS threads on start, and they perform all jobs inside the daemon, such as executing queries, creating snippets, etc. Some operations may be split into sub-tasks and executed in parallel, for example:
By default, it's set to the number of CPU cores on the server. Manticore creates the threads on start and keeps them until it's stopped. Each sub-task can use one of the threads when it needs it. When the sub-task finishes, it releases the thread so another sub-task can use it.
In the case of intensive I/O type of load, it might make sense to set the value higher than the number of CPU cores.
threads = 10
Maximum stack size for a job (coroutine, one search query may cause multiple jobs/coroutines). Optional, default is 128K.
Each job has its own stack of 128K. When you run a query, it's checked for how much stack it requires. If the default 128K is enough, it's just processed. If it needs more, another job with an increased stack is scheduled, which continues processing. The maximum size of such an advanced stack is limited by this setting.
Setting the value to a reasonably high rate will help with processing very deep queries without implying that overall RAM consumption will grow too high. For example, setting it to 1G does not imply that every new job will take 1G of RAM, but if we see that it requires, let's say, 100M stack, we just allocate 100M for the job. Other jobs at the same time will be running with their default 128K stack. The same way, we can run even more complex queries that need 500M. And only if we see internally that the job requires more than 1G of stack, we will fail and report about too low thread_stack.
However, in practice, even a query which needs 16M of stack is often too complex for parsing and consumes too much time and resources to be processed. So, the daemon will process it, but limiting such queries by the thread_stack setting looks quite reasonable.
thread_stack = 8M
Determines whether to unlink .old table copies on successful rotation. Optional, default is 1 (do unlink).
unlink_old = 0
Threaded server watchdog. Optional, default is 1 (watchdog enabled).
When a Manticore query crashes, it can take down the entire server. With the watchdog feature enabled, searchd also maintains a separate lightweight process that monitors the main server process and automatically restarts it in case of abnormal termination. The watchdog is enabled by default.
watchdog = 0 # disable watchdog
The lemmatizer_base is an optional configuration directive that specifies the base path for lemmatizer dictionaries. The default path is /usr/share/manticore
The lemmatizer implementation in Manticore Search (see Morphology to learn what lemmatizers are) is dictionary-driven and requires specific dictionary files for different languages. These files can be downloaded from the Manticore website (https://manticoresearch.com/install/#other-downloads).
Example:
lemmatizer_base = /usr/share/manticore/
The progressive_merge is a configuration directive that, when enabled, merges real-time table disk chunks from smaller to larger ones. This approach speeds up the merging process and reduces read/write amplification. By default, this setting is enabled. If disabled, the chunks are merged in the order they were created.
The json_autoconv_keynames is an optional configuration directive that determines if and how to auto-convert key names within JSON attributes. The known value is 'lowercase'. By default, this setting is unspecified (meaning no conversion occurs).
When set to lowercase, key names within JSON attributes will be automatically converted to lowercase during indexing. This conversion applies to JSON attributes from all data sources, including SQL and XMLpipe2.
Example:
json_autoconv_keynames = lowercase
The json_autoconv_numbers is an optional configuration directive that determines whether to automatically detect and convert JSON strings that represent numbers into numeric attributes. The default value is 0 (do not convert strings into numbers).
When this option is set to 1, values such as "1234" will be indexed as numbers instead of strings. If the option is set to 0, such values will be indexed as strings. This conversion applies to JSON attributes from all data sources, including SQL and XMLpipe2.
Example:
json_autoconv_numbers = 1
on_json_attr_error is an optional configuration directive that specifies the action to take if JSON format errors are found. The default value is ignore_attr(ignore errors). This setting applies only to sql_attr_json attributes.
By default, JSON format errors are ignored (ignore_attr), and the indexer tool will show a warning. Setting this option to fail_index will cause indexing to fail at the first JSON format error.
Example:
on_json_attr_error = ignore_attr
The plugin_dir is an optional configuration directive that specifies the trusted location for dynamic libraries (UDFs). The default path is /usr/local/lib/manticore/.
This directive sets the trusted directory from which the UDF libraries can be loaded.
Example:
plugin_dir = /usr/local/lib/manticore/
Manticore Search supports the use of special suffixes to simplify numeric values with specific meanings. These suffixes are categorized into size suffixes and time suffixes. The common format for suffixes is an integer followed by a literal, such as 10k or 100d. Literals are case-insensitive, so 10W and 10w are considered the same.
k for kilobytes (1k = 1024 bytes)m for megabytes (1m = 1024k)g for gigabytes (1g = 1024m)t for terabytes (1t = 1024g)
Time suffixes: These suffixes can be used in settings that define time interval values, such as delays or timeouts. Unadorned values for these parameters usually have a documented scale, but instead of guessing, you can use an explicit suffix. The available time suffixes are:
us for microsecondsms for millisecondss for secondsm for minutesh for hoursd for daysw for weeksManticore configuration supports shebang syntax, allowing the configuration to be written in a programming language and interpreted at loading. This enables dynamic settings, such as generating tables by querying a database table, modifying settings based on external factors, or including external files containing table and source declarations.
The configuration file is parsed by the declared interpreter, and the output is used as the actual configuration. This occurs each time the configuration is read, not only at searchd startup.
Note: This feature is not available on the Windows platform.
In the following example, PHP is used to create multiple tables with different names and to scan a specific folder for files containing extra table declarations:
#!/usr/bin/php
...
<?php for ($i=1; $i<=6; $i++) { ?>
table test_<?=$i?> {
type = rt
path = /var/lib/manticore/data/test_<?=$i?>
rt_field = subject
...
}
<?php } ?>
...
<?php
$confd_folder='/etc/manticore.conf.d/';
$files = scandir($confd_folder);
foreach($files as $file)
{
if(($file == '.') || ($file =='..'))
{} else {
$fp = new SplFileInfo($confd_folder.$file);
if('conf' == $fp->getExtension()){
include ($confd_folder.$file);
}
}
}
?>
Manticore Search's configuration file supports comments, which help provide explanations or notes within the configuration file. The # character is used to start a comment section. You can place the comment character either at the beginning of a line or inline within a line.
When using comments, be cautious when incorporating the # character in character tokenization settings, as everything following it will be ignored. To prevent this, use the # UTF-8 code, which is U+23.
If you need to use the # character within your configuration file, such as within database credentials in source declarations, you can escape it using a backslash \. This allows you to include the # character in your settings without it being interpreted as the start of a comment.
Inheritance in index and source declarations enables better organization of tables with similar settings or structures and reduces the configuration size. Both parent and child tables or sources can utilize inheritance.
No specific configurations are needed for a parent table or source.
In the child table or source declaration, specify the table or source name followed by a colon (:) and the parent name:
table parent {
path = /var/lib/manticore/parent
...
}
table child:parent {
path = /var/lib/manticore/child
...
}
The child will inherit the entire configuration of the parent. Any settings declared in the child will overwrite the inherited values. Be aware that for multi-value settings, defining a single value in the child will clear all inherited values. For example, if the parent has several sql_query_pre declarations and the child has a single sql_query_pre declaration, all inherited sql_query_pre declarations are cleared. To override some of the inherited values from the parent, explicitly declare them in the child. This is also applicable if you don't need a value from the parent. For example, if the sql_query_pre value from the parent is not needed, declare the directive with an empty value in the child like sql_query_pre=.
Note that existing values of a multi-value setting will not be copied if the child declares one value for that setting.
The inheritance behavior applies to fields and attributes, not just table options. For example, if the parent has two integer attributes and the child needs a new integer attribute, the integer attribute declarations from the parent must be copied into the child configuration.
SET [GLOBAL] server_variable_name = value
SET [INDEX index_name] GLOBAL @user_variable_name = (int_val1 [, int_val2, ...])
SET NAMES value [COLLATE value]
SET @@dummy_variable = ignored_value
The SET statement in Manticore Search allows you to modify variable values. Variable names are case-insensitive, and no variable value changes will persist after a server restart.
Manticore Search supports the SET NAMES statement and SET @@variable_name syntax for compatibility with third-party MySQL client libraries, connectors, and frameworks that may require running these statements when connecting. However, these statements do not have any effect on Manticore Search itself.
There are four classes of variables in Manticore Search:
set var_name = valueset global var_name = valueset global @var_name = (value)set index dist_index_name global @var_name = (value)Global user variables are shared between concurrent sessions. The only supported value type is a list of BIGINTs, and these variables can be used with the IN() operator for filtering purposes. The primary use case for this feature is to upload large lists of values to searchd once and reuse them multiple times later, reducing network overhead. Global user variables can be transferred to all agents of a distributed table or set locally in the case of a local table defined in a distributed table. Example:
// in session 1
mysql> SET GLOBAL @myfilter=(2,3,5,7,11,13);
Query OK, 0 rows affected (0.00 sec)
// later in session 2
mysql> SELECT * FROM test1 WHERE group_id IN @myfilter;
+------+--------+----------+------------+-----------------+------+
| id | weight | group_id | date_added | title | tag |
+------+--------+----------+------------+-----------------+------+
| 3 | 1 | 2 | 1299338153 | another doc | 15 |
| 4 | 1 | 2 | 1299338153 | doc number four | 7,40 |
+------+--------+----------+------------+-----------------+------+
2 rows in set (0.02 sec)
Manticore Search supports per-session and global server variables that affect specific server settings in their respective scopes. Below is a list of known per-session and global server variables:
Known per-session server variables:
AUTOCOMMIT = {0 | 1} determines if data modification statements should be implicitly wrapped by BEGIN and COMMIT.COLLATION_CONNECTION = collation_name selects the collation for ORDER BY or GROUP BY on string values in subsequent queries. Refer to Collations for a list of known collation names.WAIT_TIMEOUT/net_read_timeout = <value> sets connection timeout, either per session or global. Global can only be set on a VIP connection.net_write_timeout = <value>: Tunes the network timeout for write operations, i.e., sending data. The global value can be changed only with VIP privileges.throttling_period = <INT_VALUE>: Interval (in milliseconds) during which the current running query will reschedule. A value of 0 disables throttling, meaning the query will occupy CPU cores until it finishes. If concurrent queries come from other connections at the same time, they will be allocated to free cores or will be suspended until a core is released. Providing a negative value (-1) resets throttling to the default compiled-in value (100ms), which means the query will be rescheduled every 100ms, allowing concurrent queries a chance to be executed. The global value (set via set global) can only be set on a VIP connection.thread_stack = <value>: Changes the default value on-the-fly, which limits the stack size provided to one task. Note that here 'thread' refers not to an OS thread, but to a userspace thread, also known as a coroutine. This can be useful if, for example, you load a percolate table with unexpectedly high requirements. In such cases, 'call pq' would fail with a message about insufficient stack size. Generally, you should stop the daemon, increase the value in the config, and then restart. However, you can also try a new value without restarting, by setting a new one with this variable. The global value can also be changed online with set global thread_stack, but this is available only from a VIP connection.optimize_by_id = {0 | 1}: Internal flag used in some debug commands.threads_ex (diagnostic): Forces Manticore to behave as if it is running on a CPU with the provided profile. As a short example, set threads_ex='4/2+6/3' indicates 'you have 4 free CPU cores, when scheduling multiple queries they should be batched by 2. Also, you have 6 free CPU cores for pseudo-sharding, parts should be batched by 3'. This option is diagnostic, as it is very helpful, for example, to see how your query would run on a configuration you don't have locally. For instance, on a 128-core CPU. Or, conversely, to quickly limit the daemon to behave as single-threaded, to locate a bottleneck or investigate a crash.PROFILING = {0 | 1} enables query profiling in the current session. Defaults to 0. See also show profile.MAX_THREADS_PER_QUERY = <POSITIVE_INT_VALUE> redefines max_threads_per_query in the runtime. Per-session variable influences only the queries run in the same session (connection), i.e. up to disconnect. Value 0 means 'no limit'. If both per-session and the global variables are set, the per-session one has a higher priority.ro = {1 | 0} switches session to read-only mode or back. In show variables output the variable displayed with name session_read_only.Known global server variables are:
QUERY_LOG_FORMAT = {plain | sphinxql} Changes the current log format.LOG_LEVEL = {info | debug | replication | debugv | debugvv} Changes the current log verboseness level.QCACHE_MAX_BYTES = <value> Changes the query_cache RAM use limit to a given value.QCACHE_THRESH_MSEC = <value> Changes the query_cache> minimum wall time threshold to a given value.QCACHE_TTL_SEC = <value> Changes the query_cache TTL for a cached result to a given value.MAINTENANCE = {0 | 1} When set to 1, puts the server in maintenance mode. Only clients with VIP connections can execute queries in this mode. All new non-VIP incoming connections are refused. Existing connections are left intact.GROUPING_IN_UTC = {0 | 1} When set to 1, causes timed grouping functions (day(), month(), year(), yearmonth(), yearmonthday()) to be calculated in UTC. Read the doc for grouping_in_utc config params for more details.TIMEZONE = <value> Specifies the timezone used by date/time-related functions. Read the doc for timezone config param for more details.QUERY_LOG_MIN_MSEC = <value> Changes the query_log_min_msec searchd settings value. In this case, it expects the value exactly in milliseconds and doesn't parse time suffixes, as in config.
Warning: this is a very specific and 'hard' variable; filtered out messages will be just dropped and not written into the log at all. Better just filter your log with something like 'grep', in this case, you'll have at least the full original log as a backup.
LOG_DEBUG_FILTER = <string value> Filters out redundant log messages. If the value is set, then all logs with level > INFO (i.e., DEBUG, DEBUGV, etc.) will be compared with the string and output only in the case they start with the given value.
MAX_THREADS_PER_QUERY = <POSITIVE_INT_VALUE> Redefines max_threads_per_query at runtime. As global, it changes behavior for all sessions. Value 0 means 'no limit'. If both per-session and global variables are set, the per-session one has a higher priority.NET_WAIT = {-1 | 0 | POSITIVE_INT_VALUE} Changes the net_wait_tm searchd settings value.IOSTATS = {0 | 1} Enables or disables I/O operations (except for attributes) reporting in the query log.CPUSTATS= {1|0} Turns on/off CPU time tracking.COREDUMP= {1|0} Turns on/off saving a core file or a minidump of the server on crash. More details here.AUTO_OPTIMIZE = {1|0} Turns on/off auto_optimize.PSEUDO_SHARDING = {1|0} Turns on/off search pseudo-sharding.SECONDARY_INDEXES = {1|0} Turns on/off secondary indexes for search queries.ES_COMPAT = {on/off/dashboards} When set to on (default), Elasticsearch-like write requests are supported; off disables the support; dashboards enables the support and also allows requests from Kibana (this functionality is experimental).RESET_NETWORK_TIMEOUT_ON_PACKET = {1|0} changes reset_network_timeout_on_packet param. Only clients with VIP connections can change this variable.optimize_cutoff = <value>: Changes the value of the config's optimize_cutoff setting on-the-fly.accurate_aggregation: Sets the default value for the option accurate_aggregation of future queries.distinct_precision_threshold: Sets the default value for the option distinct_precision_threshold of future queries.expansion_merge_threshold_docs: Changes the value of the config's expansion_merge_threshold_docs setting on-the-fly.expansion_merge_threshold_hits: Changes the value of the config's expansion_merge_threshold_hits setting on-the-fly.Examples:
mysql> SET autocommit=0;
Query OK, 0 rows affected (0.00 sec)
mysql> SET GLOBAL query_log_format=sphinxql;
Query OK, 0 rows affected (0.00 sec)
mysql> SET GLOBAL @banned=(1,2,3);
Query OK, 0 rows affected (0.01 sec)
mysql> SET INDEX users GLOBAL @banned=(1,2,3);
Query OK, 0 rows affected (0.01 sec)
To make user variables persistent, make sure sphinxql_state is enabled.
Logstash is a log management tool that collects data from a variety of sources, transforms it on the fly, and sends it to your desired destination. It is often used as a data pipeline for Elasticsearch, an open-source analytics and search engine.
Now, Manticore supports the use of Logstash as a processing pipeline. This allows the collected and transformed data to be sent to Manticore just like to Elasticsearch. Currently, all the versions >= 7.6 are supported.
Let’s examine a simple example of a Logstash config file used for indexing dpkg.log, a standard log file of the Debian package manager. The log itself has a simple structure, as shown below:
2023-05-31 10:42:55 status triggers-awaited ca-certificates-java:all 20190405ubuntu1.1
2023-05-31 10:42:55 trigproc libc-bin:amd64 2.31-0ubuntu9.9 <none>
2023-05-31 10:42:55 status half-configured libc-bin:amd64 2.31-0ubuntu9.9
2023-05-31 10:42:55 status installed libc-bin:amd64 2.31-0ubuntu9.9
2023-05-31 10:42:55 trigproc systemd:amd64 245.4-4ubuntu3.21 <none>
Here is an example Logstash configuration:
input {
file {
path => ["/var/log/dpkg.log"]
start_position => "beginning"
sincedb_path => "/dev/null"
mode => "read"
exit_after_read => "true"
file_completed_action => "log"
file_completed_log_path => "/dev/null"
}
}
output {
elasticsearch {
index => " dpkg_log"
hosts => ["http://localhost:9308"]
ilm_enabled => false
manage_template => false
}
}
Note that, before proceeding further, one crucial caveat needs to be addressed: Manticore does not support Log Template Management and the Index Lifecycle Management features of Elasticsearch. As these features are enabled by default in Logstash, they need to be explicitly disabled in the config. Additionally, the hosts option in the output config section must correspond to Manticore’s HTTP listen port (default is localhost:9308).
After adjusting the config as described, you can run Logstash, and the data from the dpkg log will be passed to Manticore and properly indexed.
Here is the resulting schema of the created table and an example of the inserted document:
mysql> DESCRIBE dpkg_log;
+------------------+--------+---------------------+
| Field | Type | Properties |
+------------------+--------+---------------------+
| id | bigint | |
| message | text | indexed stored |
| @version | text | indexed stored |
| @timestamp | text | indexed stored |
| path | text | indexed stored |
| host | text | indexed stored |
+------------------+--------+---------------------+
mysql> SELECT * FROM dpkg_log LIMIT 1\G
*************************** 1. row ***************************
id: 7280000849080746110
host: logstash-db848f65f-lnlf9
message: 2023-04-12 02:03:21 status unpacked libc-bin:amd64 2.31-0ubuntu9
path: /var/log/dpkg.log
@timestamp: 2023-06-16T09:23:57.405Z
@version: 1
Filebeat is a lightweight shipper for forwarding and centralizing log data. Once installed as an agent, it monitors the log files or locations you specify, collects log events, and forwards them for indexing, usually to Elasticsearch or Logstash.
Now, Manticore also supports the use of Filebeat as processing pipelines. This allows the collected and transformed data to be sent to Manticore just like to Elasticsearch. Currently, all the versions >= 7.10 are supported.
Below is a Filebeat config to work with our example dpkg log:
filebeat.inputs:
- type: filestream
id: example
paths:
- /var/log/dpkg.log
output.elasticsearch:
hosts: ["http://localhost:9308"]
index: "dpkg_log"
allow_older_versions: true
setup.ilm:
enabled: false
setup.template:
name: "dpkg_log"
pattern: "dpkg_log"
Note that Filebeat versions higher than 8.10 have the output compression feature enabled by default. That is why the compression_level: 0 option must be added to the configuration file to provide compatibility with Manticore:
filebeat.inputs:
- type: filestream
id: example
paths:
- /var/log/dpkg.log
output.elasticsearch:
hosts: ["http://localhost:9308"]
index: "dpkg_log"
allow_older_versions: true
compression_level: 0
setup.ilm:
enabled: false
setup.template:
name: "dpkg_log"
pattern: "dpkg_log"
Once you run Filebeat with this configuration, log data will be sent to Manticore and properly indexed. Here is the resulting schema of the table created by Manticore and an example of the inserted document:
mysql> DESCRIBE dpkg_log;
+------------------+--------+--------------------+
| Field | Type | Properties |
+------------------+--------+--------------------+
| id | bigint | |
| @timestamp | text | indexed stored |
| message | text | indexed stored |
| log | json | |
| input | json | |
| ecs | json | |
| host | json | |
| agent | json | |
+------------------+--------+--------------------+
mysql> SELECT * FROM dpkg_log LIMIT 1\G
*************************** 1. row ***************************
id: 7280000849080753116
@timestamp: 2023-06-16T09:27:38.792Z
message: 2023-04-12 02:06:08 status half-installed libhogweed5:amd64 3.5.1+really3.5.1-2
input: {"type":"filestream"}
ecs: {"version":"1.6.0"}
host: {"name":"logstash-db848f65f-lnlf9"}
agent: {"ephemeral_id":"587c2ebc-e7e2-4e27-b772-19c611115996","id":"2e3d985b-3610-4b8b-aa3b-2e45804edd2c","name":"logstash-db848f65f-lnlf9","type":"filebeat","version":"7.10.0","hostname":"logstash-db848f65f-lnlf9"}
log: {"offset":80,"file":{"path":"/var/log/dpkg.log"}}
SphinxSE is a MySQL storage engine that can be compiled into MySQL/MariaDB servers using their pluggable architecture.
Despite its name, SphinxSE does not actually store any data itself. Instead, it serves as a built-in client that enables the MySQL server to communicate with searchd, execute search queries, and retrieve search results. All indexing and searching take place outside MySQL.
Some common SphinxSE applications include:
You will need to obtain a copy of MySQL sources, prepare those, and then recompile MySQL binary. MySQL sources (mysql-5.x.yy.tar.gz) could be obtained from http://dev.mysql.com website.
sphinx.5.0.yy.diff patch file into MySQL sources directory and run$ patch -p1 < sphinx.5.0.yy.diff
If there's no .diff file exactly for the specific version you need to: build, try applying .diff with closest version numbers. It is important that the patch should apply with no rejects.
2. in MySQL sources directory, run
$ sh BUILD/autorun.sh
sql/sphinx directory in and copy all files in mysqlse directory from Manticore sources there. Example:$ cp -R /root/builds/sphinx-0.9.7/mysqlse /root/builds/mysql-5.0.24/sql/sphinx
$ ./configure --with-sphinx-storage-engine
$ make
$ make install
storage/sphinx directory and copy all files from the mysqlse directory in the Manticore sources to this new location. For example:$ cp -R /root/builds/sphinx-0.9.7/mysqlse /root/builds/mysql-5.1.14/storage/sphinx
$ sh BUILD/autorun.sh
$ ./configure --with-plugins=sphinx
$ make
$ make install
To verify that SphinxSE has been successfully compiled into MySQL, start the newly built server, run the MySQL client, and issue the SHOW ENGINES query. You should see a list of all available engines. Manticore should be present, and the "Support" column should display "YES":
mysql> show engines;
+------------+----------+-------------------------------------------------------------+
| Engine | Support | Comment |
+------------+----------+-------------------------------------------------------------+
| MyISAM | DEFAULT | Default engine as of MySQL 3.23 with great performance |
...
| SPHINX | YES | Manticore storage engine |
...
+------------+----------+-------------------------------------------------------------+
13 rows in set (0.00 sec)
To search using SphinxSE, you'll need to create a special ENGINE=SPHINX "search table" and then use a SELECT statement with the full-text query placed in the WHERE clause for the query column.
Here's an example create statement and search query:
CREATE TABLE t1
(
id INTEGER UNSIGNED NOT NULL,
weight INTEGER NOT NULL,
query VARCHAR(3072) NOT NULL,
group_id INTEGER,
INDEX(query)
) ENGINE=SPHINX CONNECTION="sphinx://localhost:9312/test";
SELECT * FROM t1 WHERE query='test it;mode=any';
In a search table, the first three columns must have the following types: INTEGER UNSIGNED or BIGINT for the 1st column (document ID), INTEGER or BIGINT for the 2nd column (match weight), and VARCHAR or TEXT for the 3rd column (your query). This mapping is fixed; you cannot omit any of these three required columns, move them around, or change their types. Additionally, the query column must be indexed, while all others should remain unindexed. Column names are ignored, so you can use any arbitrary names.
Additional columns must be either INTEGER, TIMESTAMP, BIGINT, VARCHAR, or FLOAT. They will be bound to attributes provided in the Manticore result set by name, so their names must match the attribute names specified in sphinx.conf. If there's no matching attribute name in the Manticore search results, the column will have NULL values.
Special "virtual" attribute names can also be bound to SphinxSE columns. Use _sph_ instead of @ for that purpose. For example, to obtain the values of @groupby, @count, or @distinct virtual attributes, use _sph_groupby, _sph_count, or _sph_distinct column names, respectively.
The CONNECTION string parameter is used to specify the Manticore host, port, and table. If no connection string is specified in CREATE TABLE, the table name * (i.e., search all tables) and localhost:9312 are assumed. The connection string syntax is as follows:
CONNECTION="sphinx://HOST:PORT/TABLENAME"
You can change the default connection string later:
mysql> ALTER TABLE t1 CONNECTION="sphinx://NEWHOST:NEWPORT/NEWTABLENAME";
You can also override these parameters on a per-query basis.
As shown in the example, both the query text and search options should be placed in the WHERE clause on the search query column (i.e., the 3rd column). Options are separated by semicolons and their names from values by an equality sign. Any number of options can be specified. The available options are:
... WHERE query='test;sort=attr_asc:group_id';
... WHERE query='test;sort=extended:@weight desc, group_id asc';
... WHERE query='test;index=test1;';
... WHERE query='test;index=test1,test2,test3;';
... WHERE query='test;weights=1,2,3;';
# only include groups 1, 5 and 19
... WHERE query='test;filter=group_id,1,5,19;';
# exclude groups 3 and 11
... WHERE query='test;!filter=group_id,3,11;';
# include groups from 3 to 7, inclusive
... WHERE query='test;range=group_id,3,7;';
# exclude groups from 5 to 25
... WHERE query='test;!range=group_id,5,25;';
# filter by a float size
... WHERE query='test;floatrange=size,2,3;';
# pick all results within 1000 meter from geoanchor
... WHERE query='test;floatrange=@geodist,0,1000;';
... WHERE query='test;maxmatches=2000;';
... WHERE query='test;cutoff=10000;';
... WHERE query='test;maxquerytime=1000;';
... WHERE query='test;groupby=day:published_ts;';
... WHERE query='test;groupby=attr:group_id;';
... WHERE query='test;groupsort=@count desc;';
... WHERE query='test;groupby=attr:country_id;distinct=site_id';
... WHERE query='test;indexweights=tbl_exact,2,tbl_stemmed,1;';
... WHERE query='test;fieldweights=title,10,abstract,3,content,1;';
... WHERE query='test;comment=marker001;';
... WHERE query='test;select=2*a+3*** as myexpr;';
searchd host name and TCP port, respectively:... WHERE query='test;host=sphinx-test.loc;port=7312;';
... WHERE query='test;mode=extended;ranker=bm25;';
... WHERE query='test;mode=extended;ranker=expr:sum(lcs);';
The "export" ranker functions similarly to ranker=expr, but it retains the per-document factor values, while ranker=expr discards them after computing the final WEIGHT() value. Keep in mind that ranker=export is intended for occasional use, such as training a machine learning (ML) function or manually defining your own ranking function, and should not be used in actual production. When utilizing this ranker, you'll likely want to examine the output of the RANKFACTORS() function, which generates a string containing all the field-level factors for each document.
SELECT *, WEIGHT(), RANKFACTORS()
FROM myindex
WHERE MATCH('dog')
OPTION ranker=export('100*bm25');
*************************** 1\. row ***************************
id: 555617
published: 1110067331
channel_id: 1059819
title: 7
content: 428
weight(): 69900
rankfactors(): bm25=699, bm25a=0.666478, field_mask=2,
doc_word_count=1, field1=(lcs=1, hit_count=4, word_count=1,
tf_idf=1.038127, min_idf=0.259532, max_idf=0.259532, sum_idf=0.259532,
min_hit_pos=120, min_best_span_pos=120, exact_hit=0,
max_window_hits=1), word1=(tf=4, idf=0.259532)
*************************** 2\. row ***************************
id: 555313
published: 1108438365
channel_id: 1058561
title: 8
content: 249
weight(): 68500
rankfactors(): bm25=685, bm25a=0.675213, field_mask=3,
doc_word_count=1, field0=(lcs=1, hit_count=1, word_count=1,
tf_idf=0.259532, min_idf=0.259532, max_idf=0.259532, sum_idf=0.259532,
min_hit_pos=8, min_best_span_pos=8, exact_hit=0, max_window_hits=1),
field1=(lcs=1, hit_count=2, word_count=1, tf_idf=0.519063,
min_idf=0.259532, max_idf=0.259532, sum_idf=0.259532, min_hit_pos=36,
min_best_span_pos=36, exact_hit=0, max_window_hits=1), word1=(tf=3,
idf=0.259532)
... WHERE query='test;geoanchor=latattr,lonattr,0.123,0.456';
One very important note is that it is much more efficient to let Manticore handle sorting, filtering, and slicing the result set, rather than increasing the max matches count and using WHERE, ORDER BY, and LIMIT clauses on the MySQL side. This is due to two reasons. First, Manticore employs a variety of optimizations and performs these tasks better than MySQL. Second, less data would need to be packed by searchd, transferred, and unpacked by SphinxSE.
You can obtain additional information related to the query results using the SHOW ENGINE SPHINX STATUS statement:
mysql> SHOW ENGINE SPHINX STATUS;
+--------+-------+-------------------------------------------------+
| Type | Name | Status |
+--------+-------+-------------------------------------------------+
| SPHINX | stats | total: 25, total found: 25, time: 126, words: 2 |
| SPHINX | words | sphinx:591:1256 soft:11076:15945 |
+--------+-------+-------------------------------------------------+
2 rows in set (0.00 sec)
You can also access this information through status variables. Keep in mind that using this method does not require super-user privileges.
mysql> SHOW STATUS LIKE 'sphinx_%';
+--------------------+----------------------------------+
| Variable_name | Value |
+--------------------+----------------------------------+
| sphinx_total | 25 |
| sphinx_total_found | 25 |
| sphinx_time | 126 |
| sphinx_word_count | 2 |
| sphinx_words | sphinx:591:1256 soft:11076:15945 |
+--------------------+----------------------------------+
5 rows in set (0.00 sec)
SphinxSE search tables can be joined with tables using other engines. Here's an example using the "documents" table from example.sql:
mysql> SELECT content, date_added FROM test.documents docs
-> JOIN t1 ON (docs.id=t1.id)
-> WHERE query="one document;mode=any";
mysql> SHOW ENGINE SPHINX STATUS;
+-------------------------------------+---------------------+
| content | docdate |
+-------------------------------------+---------------------+
| this is my test document number two | 2006-06-17 14:04:28 |
| this is my test document number one | 2006-06-17 14:04:28 |
+-------------------------------------+---------------------+
2 rows in set (0.00 sec)
+--------+-------+---------------------------------------------+
| Type | Name | Status |
+--------+-------+---------------------------------------------+
| SPHINX | stats | total: 2, total found: 2, time: 0, words: 2 |
| SPHINX | words | one:1:2 document:2:2 |
+--------+-------+---------------------------------------------+
2 rows in set (0.00 sec)
SphinxSE also features a UDF function that allows you to create snippets using MySQL. This functionality is similar to HIGHLIGHT(), but can be accessed through MySQL+SphinxSE.
The binary providing the UDF is called sphinx.so and should be automatically built and installed in the appropriate location along with SphinxSE. If it doesn't install automatically for some reason, locate sphinx.so in the build directory and copy it to your MySQL instance's plugins directory. Once done, register the UDF with the following statement:
CREATE FUNCTION sphinx_snippets RETURNS STRING SONAME 'sphinx.so';
The function name must be sphinx_snippets; you cannot use an arbitrary name. The function arguments are as follows:
Prototype: function sphinx_snippets ( document, table, words [, options] );
The document and words arguments can be either strings or table columns. Options must be specified like this: 'value' AS option_name. For a list of supported options, refer to the Highlighting section. The only UDF-specific additional option is called sphinx and allows you to specify the searchd location (host and port).
Usage examples:
SELECT sphinx_snippets('hello world doc', 'main', 'world',
'sphinx://192.168.1.1/' AS sphinx, true AS exact_phrase,
'[**]' AS before_match, '[/**]' AS after_match)
FROM documents;
SELECT title, sphinx_snippets(text, 'index', 'mysql php') AS text
FROM sphinx, documents
WHERE query='mysql php' AND sphinx.id=documents.id;
With the MySQL FEDERATED engine, you can connect to a local or remote Manticore instance from MySQL/MariaDB and perform search queries.
An actual Manticore query can't be used directly with the FEDERATED engine and must be "proxied" (sent as a string in a column) due to the FEDERATED engine's limitations and the fact that Manticore implements custom syntax like the MATCH clause.
To search via FEDERATED, you first need to create a FEDERATED engine table. The Manticore query will be included in a query column in the SELECT performed over the FEDERATED table.
Creating a FEDERATED-compatible MySQL table:
CREATE TABLE t1
(
id INTEGER UNSIGNED NOT NULL,
year INTEGER NOT NULL,
rating FLOAT,
query VARCHAR(1024) NOT NULL,
INDEX(query)
) ENGINE=FEDERATED
DEFAULT CHARSET=utf8
CONNECTION='mysql://FEDERATED@127.0.0.1:9306/DB/movies';
Query OK, 0 rows affected (0.00 sec)
Query FEDERATED compatible table:
SELECT * FROM t1 WHERE query='SELECT * FROM movies WHERE MATCH (\'pie\')';
+----+------+--------+------------------------------------------+
| id | year | rating | query |
+----+------+--------+------------------------------------------+
| 1 | 2019 | 5 | SELECT * FROM movies WHERE MATCH ('pie') |
+----+------+--------+------------------------------------------+
1 row in set (0.04 sec)
The only fixed mapping is the query column. It is mandatory and must be the only column with a table attached.
The Manticore table linked via FEDERATED must be a physical table (plain or real-time).
The FEDERATED table should have columns with the same names as the remote Manticore table attributes since they will be bound to the attributes provided in the Manticore result set by name. However, it might map only some attributes, not all of them.
Manticore server identifies a query from a FEDERATED client by the user name "FEDERATED". The CONNECTION string parameter is used to specify the Manticore host, SQL port, and tables for queries coming through the connection. The connection string syntax is as follows:
CONNECTION="mysql://FEDERATED@HOST:PORT/DB/TABLENAME"
Since Manticore doesn't have the concept of a database, the DB string can be random as it will be ignored by Manticore, but MySQL requires a value in the CONNECTION string definition. As seen in the example, the full SELECT SQL query should be placed in a WHERE clause against the query column.
Only the SELECT statement is supported, not INSERT, REPLACE, UPDATE, or DELETE.
One very important note is that it is much more efficient to allow Manticore to perform sorting, filtering, and slicing the result set than to increase the max matches count and use WHERE, ORDER BY, and LIMIT clauses on the MySQL side. This is for two reasons. First, Manticore implements a number of optimizations and performs better than MySQL for these tasks. Second, less data needs to be packed by searchd, transferred, and unpacked between Manticore and MySQL.
JOINs can be performed between a FEDERATED table and other MySQL tables. This can be used to retrieve information that is not stored in a Manticore table.
SELECT t1.id, t1.year, comments.comment FROM t1 JOIN comments ON t1.id=comments.post_id WHERE query='SELECT * FROM movies WHERE MATCH (\'pie\')';
+----+------+--------------+
| id | year | comment |
+----+------+--------------+
| 1 | 2019 | was not good |
+----+------+--------------+
1 row in set (0.00 sec)
Manticore can be extended with user-defined functions, or UDFs for short, like this:
SELECT id, attr1, myudf (attr2, attr3+attr4) ...
You can dynamically load and unload UDFs into searchd without having to restart the server, and use them in expressions when searching, ranking, etc. A quick summary of the UDF features is as follows:
PACKEDFACTORS() arguments.We do not yet support aggregation functions. In other words, your UDFs will be called for just a single document at a time and are expected to return some value for that document. Writing a function that can compute an aggregate value like AVG() over the entire group of documents that share the same GROUP BY key is not yet possible. However, you can use UDFs within the built-in aggregate functions: that is, even though MYCUSTOMAVG() is not supported yet, AVG(MYCUSTOMFUNC()) should work just fine!
UDFs offer a wide range of applications, such as:
Plugins offer additional opportunities to expand search functionality. They can currently be used to compute custom rankings and tokenize documents and queries.
Here's the complete list of plugin types:
This section covers the general process of writing and managing plugins; specifics related to creating different types of plugins are discussed in their respective subsections.
So, how do you write and use a plugin? Here's a quick four-step guide:
Note that while UDFs are first-class plugins, they are installed using a separate CREATE FUNCTION statement. This allows for a neat specification of the return type, without sacrificing backward compatibility or changing the syntax.
Dynamic plugins are supported in threads and thread_pool workers. Multiple plugins (and/or UDFs) can be contained in a single library file. You may choose to either group all project-specific plugins in one large library or create a separate library for each UDF and plugin; it's up to you.
As with UDFs, you should include the src/sphinxudf.h header file. At the very least, you'll need the SPH_UDF_VERSION constant to implement an appropriate version function. Depending on the specific plugin type, you may or may not need to link your plugin with src/sphinxudf.c. However, all functions implemented in sphinxudf.c are related to unpacking the PACKEDFACTORS() blob, and no plugin types have access to that data. So currently, linking with just the header should suffice. (In fact, if you copy over the UDF version number, you won't even need the header file for some plugin types.)
Formally, plugins are simply sets of C functions that adhere to a specific naming pattern. You're typically required to define one key function for the primary task, but you can also define additional functions. For instance, to implement a ranker called "myrank", you must define a myrank_finalize() function that returns the rank value. However, you can also define myrank_init(), myrank_update(), and myrank_deinit() functions. Specific sets of well-known suffixes and call arguments differ based on the plugin type, but _init() and _deinit() are generic, and every plugin has them. Hint: for a quick reference on known suffixes and their argument types, refer to sphinxplugin.h, where the call prototypes are defined at the beginning of the file.
Even though the public interface is defined in pure C, our plugins essentially follow an object-oriented model. Indeed, every _init() function receives a void ** userdata out-parameter, and the pointer value stored at (*userdata) is then passed as the first argument to all other plugin functions. So you can think of a plugin as a class that gets instantiated every time an object of that class is needed to handle a request: the userdata pointer serves as the this pointer; the functions act as methods, and the _init() and _deinit() functions work as constructor and destructor, respectively.
This minor OOP-in-C complication arises because plugins run in a multi-threaded environment, and some need to maintain state. You can't store that state in a global variable in your plugin, so we pass around a userdata parameter, which naturally leads to the OOP model. If your plugin is simple and stateless, the interface allows you to omit _init(), _deinit(), and any other functions.
To summarize, here's the simplest complete ranker plugin in just three lines of C code:
// gcc -fPIC -shared -o myrank.so myrank.c
#include "sphinxudf.h"
int myrank_ver() { return SPH_UDF_VERSION; }
int myrank_finalize(void *u, int w) { return 123; }
Here's how to use the simple ranker plugin:
mysql> CREATE PLUGIN myrank TYPE 'ranker' SONAME 'myrank.dll';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT id, weight() FROM test1 WHERE MATCH('test') OPTION ranker=myrank('');
+------+----------+
| id | weight() |
+------+----------+
| 1 | 123 |
| 2 | 123 |
+------+----------+
2 rows in set (0.01 sec)
SHOW PLUGINS
Displays all the loaded plugins (except for Buddy plugins, see below) and UDFs. The "Type" column should be one of the udf, ranker, index_token_filter, or query_token_filter. The "Users" column is the number of thread that are currently using that plugin in a query. The "Extra" column is intended for various additional plugin-type specific information; currently, it shows the return type for the UDFs and is empty for all the other plugin types.
SHOW PLUGINS;
+------+----------+----------------+-------+-------+
| Type | Name | Library | Users | Extra |
+------+----------+----------------+-------+-------+
| udf | sequence | udfexample.dll | 0 | INT |
+------+----------+----------------+-------+-------+
1 row in set (0.00 sec)
SHOW BUDDY PLUGINS
This will display all available plugins, including core and local ones.
To remove a plugin, make sure to use the name listed in the Package column.
SHOW BUDDY PLUGINS;
+------------------------------------------------+-----------------+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Package | Plugin | Version | Type | Info |
+------------------------------------------------+-----------------+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| manticoresoftware/buddy-plugin-empty-string | empty-string | 2.1.5 | core | Handles empty queries, which can occur when trimming comments or dealing with specific SQL protocol instructions in comments that are not supported |
| manticoresoftware/buddy-plugin-backup | backup | 2.1.5 | core | BACKUP sql statement |
| manticoresoftware/buddy-plugin-emulate-elastic | emulate-elastic | 2.1.5 | core | Emulates some Elastic queries and generates responses as if they were made by ES |
| manticoresoftware/buddy-plugin-insert | insert | 2.1.5 | core | Auto schema support. When an insert operation is performed and the table does not exist, it creates it with data types auto-detection |
| manticoresoftware/buddy-plugin-alias | alias | 2.1.5 | core | |
| manticoresoftware/buddy-plugin-select | select | 2.1.5 | core | Various SELECTs handlers needed for mysqldump and other software support, mostly aiming to work similarly to MySQL |
| manticoresoftware/buddy-plugin-show | show | 2.1.5 | core | Various "show" queries handlers, for example, `show queries`, `show fields`, `show full tables`, etc |
| manticoresoftware/buddy-plugin-cli-table | cli-table | 2.1.5 | core | /cli endpoint based on /cli_json - outputs query result as a table |
| manticoresoftware/buddy-plugin-plugin | plugin | 2.1.5 | core | Core logic for plugin support and helpers. Also handles `create buddy plugin`, `delete buddy plugin`, and `show buddy plugins` |
| manticoresoftware/buddy-plugin-test | test | 2.1.5 | core | Test plugin, used exclusively for tests |
| manticoresoftware/buddy-plugin-insert-mva | insert-mva | 2.1.5 | core | Manages the restoration of MVA fields with mysqldump |
| manticoresoftware/buddy-plugin-modify-table | modify-table | 2.1.5 | core | Assists in standardizing options in create and alter table statements to show option=1 for integers. Also manages the logic for creating sharded tables. |
| manticoresoftware/buddy-plugin-knn | knn | 2.1.5 | core | Enables KNN by document id |
| manticoresoftware/buddy-plugin-replace | replace | 2.1.5 | core | Enables partial replaces |
+------------------------------------------------+-----------------+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
UDFs are stored in external dynamic libraries (.so files on UNIX and .dll on Windows systems). Library files must be placed in a trusted folder specified by the plugin_dir directive for security reasons: it's easier to secure a single folder than to allow anyone to install arbitrary code into searchd. You can dynamically load and unload UDFs into searchd using CREATE FUNCTION and DROP FUNCTION SQL statements, respectively. Additionally, you can seamlessly reload UDFs (and other plugins) with the RELOAD PLUGINS statement. Manticore keeps track of currently loaded functions; every time you create or drop a UDF, searchd updates its state in the sphinxql_state file as a plain SQL script.
UDFs are local. To use them on a cluster, you must place the same library on all nodes and run CREATE statements on each node as well. This process may change in future versions.
Once you successfully load a UDF, you can use it in your SELECT or other statements just like any built-in function:
SELECT id, MYCUSTOMFUNC (groupid, authorname), ... FROM myindex
Multiple UDFs (and other plugins) can reside in a single library. The library will only be loaded once and is automatically unloaded once all the UDFs and plugins within it are dropped.
In theory, you can write a UDF in any language, as long as its compiler can import standard C headers and emit standard dynamic libraries with properly exported functions. However, writing in C++ or plain C is the path of least resistance. We provide an example UDF library written in plain C that implements several functions (demonstrating various techniques) alongside our source code, found at src/udfexample.c. This example includes the src/sphinxudf.h header file, which contains definitions of several UDF-related structures and types. For most UDFs and plugins, simply using #include "sphinxudf.h" as shown in the example should be sufficient. However, if you're writing a ranking function and need to access ranking signals (factors) data from within the UDF, you'll also need to compile and link with src/sphinxudf.c (available in our source code), as the implementations of functions that let you access signal data from within the UDF reside in that file.
Both the sphinxudf.h header and sphinxudf.c are standalone, so you can copy those files individually; they don't depend on any other parts of Manticore's source code.
Within your UDF, you must implement and export only a couple of functions. First, for UDF interface version control, you must define a function int LIBRARYNAME_ver(), where LIBRARYNAME is the name of your library file, and you must return SPH_UDF_VERSION (a value defined in sphinxudf.h) from it. Here's an example.
#include <sphinxudf.h>
// our library will be called udfexample.so, thus, so it must define
// a version function named udfexample_ver()
int udfexample_ver()
{
return SPH_UDF_VERSION;
}
This precaution protects you from accidentally loading a library with a mismatching UDF interface version into a newer or older searchd. Secondly, you must implement the actual function as well.
sphinx_int64_t testfunc ( SPH_UDF_INIT * init, SPH_UDF_ARGS * args, char * error_flag )
{
return 123;
}
UDF function names in SQL are case-insensitive. However, the respective C function names are not; they need to be all lower-case, or the UDF will not load. More importantly, it is crucial that:
CREATE FUNCTION.Unfortunately, there is no (easy) way for us to check for these mistakes when loading the function, and they could crash the server and/or result in unexpected results. Last but not least, all the C functions you implement need to be thread-safe.
The first argument, a pointer to SPH_UDF_INIT structure, is essentially a pointer to our function state. It is optional. In the example just above, the function is stateless, as it simply returns 123 every time it gets called. So, we do not have to define an initialization function, and we can simply ignore that argument.
This argument serves one more purpose. Since a single query can be executed on multiple threads (see pseudo-sharding), the daemon tries to determine whether a UDF is stateful or stateless by checking this argument. If the argument is initialized, parallel execution will be disabled. So, if your UDF is stateful but you don't use this argument, it will be called from multiple threads, and your code needs to be aware of that.
The second argument, a pointer to SPH_UDF_ARGS, is the most important one. All the actual call arguments are passed to your UDF via this structure; it contains the call argument count, names, types, etc. So, whether your function gets called like SELECT id, testfunc(1) or like SELECT id, testfunc('abc', 1000*id+gid, WEIGHT()) or any other way, it will receive the very same SPH_UDF_ARGS structure in all of these cases. However, the data passed in the args structure will be different. In the first example, args->arg_count will be set to 1, in the second example it will be set to 3, and the args->arg_types array will contain different type data, and so on.
Finally, the third argument is an error flag. A UDF can raise it to indicate that some kind of internal error occurred, the UDF cannot continue, and the query should terminate early. You should not use this for argument type checks or for any other error reporting that is likely to happen during normal use. This flag is designed to report sudden critical runtime errors, such as running out of memory.
If we wanted to, say, allocate temporary storage for our function to use, or check upfront whether the arguments are of the supported types, then we would need to add two more functions, for UDF initialization and deinitialization, respectively.
int testfunc_init ( SPH_UDF_INIT * init, SPH_UDF_ARGS * args,
char * error_message )
{
// allocate and initialize a little bit of temporary storage
init->func_data = malloc ( sizeof(int) );
*(int*)init->func_data = 123;
// return a success code
return 0;
}
void testfunc_deinit ( SPH_UDF_INIT * init )
{
// free up our temporary storage
free ( init->func_data );
}
Note how testfunc_init() also receives the call arguments structure. By the time it is called, it does not receive any actual values, so the args->arg_values will be NULL. But the argument names and types are known and will be passed. You can check them in the initialization function and return an error if they are of an unsupported type.
UDFs can receive arguments of pretty much any valid internal Manticore type. Refer to the sphinx_udf_argtype enumeration in sphinxudf.h for a full list. Most of the types map straightforwardly to the respective C types.
The most notable type is the SPH_UDF_TYPE_FACTORS argument type. You get that type by calling your UDF with a PACKEDFACTOR() argument. Its data is a binary blob in a certain internal format, and to extract individual ranking signals from that blob, you need to use either of the two sphinx_factors_XXX() or sphinx_get_YYY_factor() families of functions.
This family consists of 3 functions.
sphinx_factors_init() initializes the unpacked SPH_UDF_FACTORS structuresphinx_factors_unpack() unpacks a binary blob into SPH_UDF_FACTORS structuresphinx_factors_deinit() cleans up and deallocates the SPH_UDF_FACTORS.First, you need to call init() and unpack(), then you can use the SPH_UDF_FACTORS fields, and finally, you need to clean up with deinit().
This approach is simple but may result in a bunch of memory allocations for each processed document, which could be slow.
The other interface, consisting of a bunch of sphinx_get_YYY_factor() functions, is a bit more verbose to use but accesses the blob data directly and guarantees no allocations. For top-notch ranking UDF performance, you'll want to use this approach.
As for the return types, UDFs can currently return a single INTEGER, BIGINT, FLOAT, or STRING value. The C function return type should be sphinx_int64_t, sphinx_int64_t, double, or char* respectively. In the last case, you must use the args->fn_malloc function to allocate space for returned string values. Internally in your UDF, you can use whatever you want, so the testfunc_init() example above is correct code even though it uses malloc() directly: you manage that pointer yourself, it gets freed up using a matching free() call, and all is well. However, the returned strings values are managed by Manticore, and we have our own allocator, so for the return values specifically, you need to use it too.
Depending on how your UDFs are used in the query, the main function call (testfunc() in our example) might be called in a rather different volume and order. Specifically,
LIMIT clause. They will be called in the result set order.LIMIT clause.The calling sequence of the other functions is fixed, though. Namely,
testfunc_init() is called once when initializing the query. It can return a non-zero code to indicate a failure; in that case, the query will be terminated, and the error message from the error_message buffer will be returned.testfunc() is called for every eligible row (see above), whenever Manticore needs to compute the UDF value. It can also indicate an (internal) failure error by writing a non-zero byte value to error_flag. In that case, it is guaranteed that it will not be called for subsequent rows, and a default return value of 0 will be substituted. Manticore might or might not choose to terminate such queries early; neither behavior is currently guaranteed.testfunc_deinit() is called once when the query processing (in a given table shard) ends.CREATE FUNCTION udf_name
RETURNS {INT | INTEGER | BIGINT | FLOAT | STRING}
SONAME 'udf_lib_file'
CREATE FUNCTION statement installs a user-defined function UDF with the specified name and type from the provided library file. The library file must be located in a trusted plugin_dir directory. Upon successful installation, the function becomes available for use in all subsequent queries received by the server. Example:
mysql> CREATE FUNCTION avgmva RETURNS INTEGER SONAME 'udfexample.dll';
Query OK, 0 rows affected (0.03 sec)
mysql> SELECT *, AVGMVA(tag) AS q from test1;
+------+--------+---------+-----------+
| id | weight | tag | q |
+------+--------+---------+-----------+
| 1 | 1 | 1,3,5,7 | 4.000000 |
| 2 | 1 | 2,4,6 | 4.000000 |
| 3 | 1 | 15 | 15.000000 |
| 4 | 1 | 7,40 | 23.500000 |
+------+--------+---------+-----------+
DROP FUNCTION udf_name
DROP FUNCTION statement uninstalls a user-defined function UDF with the specified name. Upon successful removal, the function will no longer be available for use in subsequent queries. However, ongoing concurrent queries will not be affected, and if necessary, the library unloading will be delayed until those queries are completed. Example:
mysql> DROP FUNCTION avgmva;
Query OK, 0 rows affected (0.00 sec)
CREATE PLUGIN plugin_name TYPE 'plugin_type' SONAME 'plugin_library'
Loads the given library (if it is not already loaded) and loads the specified plugin from it. The available plugin types include:
rankerindex_token_filterquery_token_filterFor more information on writing plugins, please refer to the plugins documentation.
mysql> CREATE PLUGIN myranker TYPE 'ranker' SONAME 'myplugins.so';
Query OK, 0 rows affected (0.00 sec)
Buddy plugins can extend Manticore Search's functionality and enable certain queries that are not natively supported. To learn more about creating Buddy plugins, we recommend reading this article.
To create a Buddy plugin, run the following SQL command:
CREATE PLUGIN <username/package name on https://packagist.org/> TYPE 'buddy' VERSION <package version>
You can also use an alias command specifically created for Buddy plugins, which is easier to remember:
CREATE BUDDY PLUGIN <username/package name on https://packagist.org/> VERSION <package version>
This command will install the show-hostname plugin to the plugin_dir and enable it without the need to restart the server.
CREATE PLUGIN manticoresoftware/buddy-plugin-show-hostname TYPE 'buddy' VERSION 'dev-main';
CREATE BUDDY PLUGIN manticoresoftware/buddy-plugin-show-hostname VERSION 'dev-main';
DROP PLUGIN plugin_name TYPE 'plugin_type'
Marks the designated plugin for unloading. The unloading process is not instantaneous, as concurrent queries may still be utilizing it. Nevertheless, following a DROP, new queries will no longer have access to the plugin. Subsequently, when all ongoing queries involving the plugin have finished, the plugin will be unloaded. If all plugins from the specified library are unloaded, the library will also be automatically unloaded.
mysql> DROP PLUGIN myranker TYPE 'ranker';
Query OK, 0 rows affected (0.00 sec)
DELETE BUDDY PLUGIN <username/package name on https://packagist.org/>
This action instantly and permanently removes the installed plugin from the plugin_dir. Once removed, the plugin's features will no longer be available.
DELETE BUDDY PLUGIN manticoresoftware/buddy-plugin-show-hostname
To simplify the control of Buddy plugins, especially when developing a new one or modifying an existing one, the enable and disable Buddy plugin commands are provided. These commands act temporarily during runtime and will reset to their defaults after restarting the daemon or performing a Buddy reset. To permanently disable a plugin, it must be removed.
You need the fully qualified package name of the plugin to enable or disable it. To find it, you can run the SHOW BUDDY PLUGINS query and look for the full qualified name in the package field. For example, the SHOW plugin has the fully qualified name manticoresoftware/buddy-plugin-show.
ENABLE BUDDY PLUGIN <username/package name on https://packagist.org/>
This command reactivates a previously disabled Buddy plugin, allowing it to process your requests again.
ENABLE BUDDY PLUGIN manticoresoftware/buddy-plugin-show
DISABLE BUDDY PLUGIN <username/package name on https://packagist.org/>
This command deactivates an active Buddy plugin, preventing it from processing any further requests.
DISABLE BUDDY PLUGIN manticoresoftware/buddy-plugin-show
RELOAD PLUGINS FROM SONAME 'plugin_library'
Reloads all plugins (UDFs, rankers, etc.) from a given library. In a sense, the reload process is transactional, ensuring that:
1. all plugins are successfully updated to their new versions;
2. the update is atomic, meaning all plugins are replaced simultaneously. This atomicity ensures that queries using multiple functions from a reloaded library will never mix old and new versions.
During the RELOAD, the set of plugins is guaranteed to be consistent; they will either be all old or all new.
The reload process is also seamless, as some version of a reloaded plugin will always be available for concurrent queries, without any temporary disruptions. This is an improvement over using a pair of DROP and CREATE statements for reloading. With those, there is a brief window between the DROP and the subsequent CREATE during which queries technically refer to an unknown plugin and will therefore fail.
If there's any failure, RELOAD PLUGINS does nothing, retains the old plugins, and reports an error.
On Windows, overwriting or deleting a DLL library currently in use can be problematic. However, you can still rename it, place a new version under the old name, and then RELOAD will work. After a successful reload, you'll also be able to delete the renamed old library.
mysql> RELOAD PLUGINS FROM SONAME 'udfexample.dll';
Query OK, 0 rows affected (0.00 sec)
Ranker plugins let you implement a custom ranker that receives all the occurrences of the keywords matched in the document, and computes a WEIGHT() value. They can be called as follows:
SELECT id, attr1 FROM test WHERE match('hello') OPTION ranker=myranker('option1=1');
The call workflow proceeds as follows:
XXX_init() is invoked once per query per table, at the very beginning. Several query-wide options are passed to it via a SPH_RANKER_INIT structure, including the user options strings (for instance, "option1=1" in the example above).XXX_update() is called multiple times for each matched document, with every matched keyword occurrence provided as its parameter, a SPH_RANKER_HIT structure. The occurrences within each document are guaranteed to be passed in ascending order of hit->hit_pos values.XXX_finalize() is called once for each matched document when there are no more keyword occurrences. It must return the WEIGHT() value. This function is the only mandatory one.XXX_deinit() is invoked once per query, at the very end.Token filter plugins allow you to implement a custom tokenizer that creates tokens according to custom rules. There are two types:
In the text processing pipeline, token filters will run after the base tokenizer processing occurs (which processes the text from fields or queries and creates tokens out of them).
Index-time tokenizer is created by indexer when indexing source data into a table or by an RT table when processing INSERT or REPLACE statements.
Plugin is declared as library name:plugin name:optional string of settings. The init functions of the plugin can accept arbitrary settings that can be passed as a string in the format option1=value1;option2=value2;...
Example:
index_token_filter = my_lib.so:email_process:field=email;split=.io
The call workflow for index-time token filter is as follows:
XXX_init() gets called right after indexer creates token filter with an empty fields list and then after indexer gets the table schema with the actual fields list. It must return zero for successful initialization or an error description otherwise.XXX_begin_document gets called only for RT table INSERT/REPLACE for every document. It must return zero for a successful call or an error description otherwise. Using OPTION token_filter_options, additional parameters/settings can be passed to the function.sql
INSERT INTO rt (id, title) VALUES (1, 'some text corp@space.io') OPTION token_filter_options='.io'XXX_begin_field gets called once for each field prior to processing the field with the base tokenizer, with the field number as its parameter.XXX_push_token gets called once for each new token produced by the base tokenizer, with the source token as its parameter. It must return the token, count of extra tokens made by the token filter, and delta position for the token.XXX_get_extra_token gets called multiple times in case XXX_push_token reports extra tokens. It must return the token and delta position for that extra token.XXX_end_field gets called once right after the source tokens from the current field are processed.XXX_deinit gets called at the very end of indexing.The following functions are mandatory to be defined: XXX_begin_document, XXX_push_token, and XXX_get_extra_token.
Query-time tokenizer gets created on search each time full-text is invoked by every table involved.
The call workflow for query-time token filter is as follows:
XXX_init() gets called once per table prior to parsing the query with parameters - max token length and a string set by the token_filter optionsql
SELECT * FROM index WHERE MATCH ('test') OPTION token_filter='my_lib.so:query_email_process:io'XXX_push_token() gets called once for each new token produced by the base tokenizer with parameters: token produced by the base tokenizer, pointer to raw token at source query string, and raw token length. It must return the token and delta position for the token.XXX_pre_morph() gets called once for the token right before it gets passed to the morphology processor with a reference to the token and stopword flag. It might set the stopword flag to mark the token as a stopword.XXX_post_morph() gets called once for the token after it is processed by the morphology processor with a reference to the token and stopword flag. It might set the stopword flag to mark the token as a stopword. It must return a flag, the non-zero value of which means to use the token prior to morphology processing.XXX_deinit() gets called at the very end of query processing.Absence of the functions is tolerated.
indextool is a helpful utility that extracts various information about a physical table, excluding template or distributed tables. Here's the general syntax for utilizing indextool:
indextool <command> [options]
These options are applicable to all commands:
--config <file> (-c <file> for short) lets you override the default configuration file names.--quiet (-q for short) suppresses the output of banners and such by indextool.--help (-h for short) displays all parameters available in your specific build of indextool.-v displays the version information of your specific indextool build.Here are the available commands:
--checkconfig loads and verifies the config file, checking its validity and for any syntax errors.--buildidf DICTFILE1 [DICTFILE2 ...] --out IDFILE constructs an IDF file from one or more dictionary dumps (refer to --dumpdict). The additional parameter --skip-uniq will omit unique words (df=1).--build-infixes TABLENAME generates infixes for a pre-existing dict=keywords table (updates .sph, .spi in place). Use this option for legacy table files already employing dict=keywords, but now requiring infix search support; updating the table files with indextool may be simpler or quicker than recreating them from scratch with indexer.--dumpheader FILENAME.sph promptly dumps the given table header file without disturbing any other table files or even the config file. The report offers a detailed view of all the table settings, especially the complete attribute and field list.--dumpconfig FILENAME.sph extracts the table definition from the specified table header file in an (almost) manticore.conf file-compliant format.--dumpheader TABLENAME dumps table header by table name while searching for the header path in the config file.--dumpdict TABLENAME dumps the dictionary. An extra -stats switch will add the total document count to the dictionary dump. This is necessary for dictionary files used in IDF file creation.--dumpdocids TABLENAME dumps document IDs by table name.--dumphitlist TABLENAME KEYWORD dumps all instances (occurrences) of a specified keyword in a given table, with the keyword defined as text.--dumphitlist TABLENAME --wordid ID dumps all instances (occurrences) of a specific keyword in a given table, with the keyword represented as an internal numeric ID.--docextract TBL DOCID executes a standard table check pass of the entire dictionary/docs/hits, and gathers all the words and hits associated with the requested document. Subsequently, all the words are arranged according to their fields and positions, and the result is printed, grouped by field.--fold TABLENAME OPTFILE This option helps understand how the tokenizer processes input. You can supply the indextool with text from a file, if specified, or from stdin otherwise. The output will replace separators with spaces (based on your charset_table settings) and convert letters in words to lowercase.--htmlstrip TABLENAME applies the HTML stripper settings for a specified table to filter stdin, and sends the filtering results to stdout. Be aware that the settings will be fetched from manticore.conf, and not from the table header.--mergeidf NODE1.idf [NODE2.idf ...] --out GLOBAL.idf combines multiple .idf files into a single one. The extra parameter --skip-uniq will ignore unique words (df=1).--morph TABLENAME applies morphology to the given stdin and directs the result to stdout.--check TABLENAME evaluates the table data files for consistency errors that could be caused by bugs in indexer or hardware faults. --check is also functional on RT tables, RAM, and disk chunks. Additional options:
--check-id-dups assesses for duplicate ids--check-disk-chunk CHUNK_NAME checks only a specific disk chunk of an RT table. The argument is the numeric extension of the RT table's disk chunk to be checked.--strip-path removes the path names from all file names referred to from the table (stopwords, wordforms, exceptions, etc). This is helpful when verifying tables built on a different machine with possibly varying path layouts.
--rotate is only compatible with --check and determines whether to check the table waiting for rotation, i.e., with a .new extension. This is useful when you wish to validate your table before actually putting it into use.--apply-killlists loads and applies kill-lists for all tables listed in the config file. Changes are saved in .SPM files. Kill-list files (.SPK) are removed. This can be handy if you want to shift the application of tables from server startup to indexing stage.The spelldump command is designed to retrieve the contents from a dictionary file that employs the ispell or MySpell format. This can be handy when you need to compile word lists for wordforms, as it generates all possible forms for you.
Here's the general syntax:
spelldump [options] <dictionary> <affix> [result] [locale-name]
The primary parameters are the main file and the affix file of the dictionary. Typically, these are named as [language-prefix].dict and [language-prefix].aff, respectively. You can find these files in most standard Linux distributions or from numerous online sources.
The [result] parameter is where the extracted dictionary data will be stored, and [locale-name] is the parameter used to specify the locale details of your choice.
There's an optional -c [file] option as well. This option allows you to specify a file for case conversion details.
Here are some usage examples:
spelldump en.dict en.aff
spelldump ru.dict ru.aff ru.txt ru_RU.CP1251
spelldump ru.dict ru.aff ru.txt .1251
The resulting file will list all the words from the dictionary, arranged alphabetically and formatted like a wordforms file. You can then modify this file as per your specific requirements. Here's a sample of what the output file might look like:
zone > zone
zoned > zoned
zoning > zoning
The wordbreaker tool is designed to deconstruct compound words, a common feature in URLs, into their individual components. For instance, it can dissect "lordoftherings" into four separate words or break down http://manofsteel.warnerbros.com into "man of steel warner bros". This ability enhances search functionality by eliminating the need for prefixes or infixes. To illustrate, a search for "sphinx" wouldn't yield "sphinxsearch" in the results. However, if you apply wordbreaker to disassemble the compound word and index the detached elements, a search will be successful without the file size expansion associated with prefix or infix usage in full-text indexing.
Here are some examples of how to use wordbreaker:
echo manofsteel | bin/wordbreaker -dict dict.txt split
man of steel
The -dict dictionary file is used to separate the input stream into individual words. If no dictionary file is specified, Wordbreaker will look for a file named wordbreaker-dict.txt in the current working directory. (Ensure that the dictionary file matches the language of the compound word you're working with.) The split command breaks words from the standard input and sends the results to the standard output. The test and bench commands are also available to assess the splitting quality and measure the performance of the splitting function, respectively.
Wordbreaker uses a dictionary to identify individual substrings within a given string. To distinguish between multiple potential splits, it considers the relative frequency of each word in the dictionary. A higher frequency indicates a higher likelihood for a word split. To generate a file of this nature, you can use the indexer tool:
indexer --buildstops dict.txt 100000 --buildfreqs myindex -c /path/to/manticore.conf
which will produce a text file named dict.txt that contains the 100,000 most frequently occurring words from myindex, along with their respective counts. Since this output file is a simple text document, you have the flexibility to manually edit it whenever needed. Feel free to add or remove words as required.
The Manticore Search API is documented using the OpenAPI specification, which can be used to generate client SDKs. The machine-readable YAML file is available at https://raw.githubusercontent.com/manticoresoftware/openapi/master/manticore.yml
You can also view the specification visualized with the online Swagger Editor here.
At Manticore, we gather various anonymized metrics to enhance the quality of our products, including Manticore Search. By analyzing this data, we can not only improve the overall performance of our product but also identify which features would be most beneficial to prioritize in order to provide even more value to our users. The telemetry system operates on a separate thread in a non-blocking mode, taking snapshots and sending them once every few minutes.
We take your privacy seriously, and you can rest assured that all metrics are completely anonymous and no sensitive information is transmitted. However, if you still wish to disable telemetry, you have the option to do so by:
TELEMETRY=0telemetry = 0 in the section searchd of your configuration fileHere is a list of all the metrics we collect:
The ⏱️ symbol indicates that the metric is collected periodically, as opposed to other metrics which are collected based on specific events.
| Metric | Description |
|---|---|
invocation |
Sent when Manticore Buddy is launched |
plugin_* |
Indicates that the plugin with a given name was executed, plugin_backup for backup execution, for example |
command_* |
⏱️ All metrics with this prefix are sent from the show status query of the Manticore daemon |
uptime |
⏱️ The uptime of the Manticore Search daemon |
workers_total |
⏱️ The number of workers used by Manticore |
cluster_count |
⏱️ How many clusters this node handles |
cluster_size |
⏱️ How many nodes in all clusters |
table_*_count |
⏱️ The number of tables created for each type: plain, percolate, rt, or distributed |
*_field_*_count |
⏱️ The count for each field type for tables with rt and percolate types |
columnar |
⏱️ Indicates that the Columnar library was used |
columnar_field_count |
⏱️ The number of fields that use the Columnar library |
The Manticore backup tool sends anonymized metrics to the Manticore metrics server by default in order to help improve the product. If you don't want to send telemetry, you can disable it by running the tool with the --disable-metric flag or by setting the environment variable TELEMETRY=0.
The following is a list of all collected metrics:
| Metric | Description |
|---|---|
invocation |
Sent when backup was initiated |
failed |
Sent in case of failed backup |
done |
Sent when backup/restore is successful |
arg_* |
The arguments used to run the tool (excluding index names, etc.) |
backup_store_versions_fails |
Indicates failure in saving Manticore version in the backup |
backup_table_count |
Total number of backed up tables |
backup_no_permissions |
Failed backup due to insufficient permissions to destination directory |
backup_total_size |
Total size of the full backup |
backup_time |
Duration of the backup |
restore_searchd_running |
Failed to run restore process due to searchd already running |
restore_no_config_file |
No config file found in the backup during restore |
restore_time |
Duration of the restore |
fsync_time |
Duration of fsync |
restore_target_exists |
Occurs when a folder or index already exists in the destination folder for restore |
terminations |
Indicates that the process was terminated |
signal_* |
The signal used to terminate the process |
tables |
Number of tables in Manticore |
config_unreachable |
Specified configuration file does not exist |
config_data_dir_missing |
Failed to parse data_dir from specified configuration file |
config_data_dir_is_relative |
data_dir path in Manticore instance's configuration file is relative |
Each metric comes with the following labels:
| Label | Description |
|---|---|
collector |
buddy. Indicates that this metric is collected through Manticore Buddy |
os_name |
Name of the operating system |
os_release_name |
Name from the /etc/os-release if presents or unknown |
os_release_version |
Version from the /etc/os-release if presents or unknown |
dockerized |
If it's run inside the Docker environment |
official_docker |
In case of Docker it's flag that shows we use official image |
machine_id |
Server identifier (the content of /etc/machine-id in Linux) |
arch |
Architecture of the machin we run on |
manticore_version |
Version of Manticore |
columnar_version |
Version of the Columnar library if it is installed |
secondary_version |
Version of the secondary library if the Columnar library is installed |
knn_version |
Version of the KNN library if the Columnar library is installed |
buddy_version |
Version of Manticore Buddy |
While 6.3.0 is being prepared for release, use the dev version which includes all the below changes - https://mnt.cr/dev/nightly
timezone - Timezone used by date/time-related functions
Commit 30e7 Added range, histogram, date_range, and date_histogram aggregates to the HTTP interface and similar expressions into SQL.
.spa (scalar attrs): 256KB -> 8MB; .spb (blob attrs): 256KB -> 8MB; .spc (columnar attrs): 1MB, no change; .spds (docstore): 256KB -> 8MB; .spidx (secondary indexes): 256KB buffer -> 128MB memory limit; .spi (dictionary): 256KB -> 16MB; .spd (doclists): 8MB, no change; .spp (hitlists): 8MB, no change; .spe (skiplists): 256KB -> 8MBlibgalera_smm.so from MySQL 5.x)._rate to _rpsindex to table in error messages; fixed bison parser error message fixupmanticore.tbl as table name.agent_connect_timeout and agent_query_timeout) for create distributed table statement.searchd.expansion_limit.SHOW STATUS@@system.sessions.CALL PQ with large packets.log_level=debug is set@timestamp column as timestampRuntimeDirectory--new-cluster, using the tool manticore_new_cluster in Linux.Read about restarting a cluster for more details.
⚠️Issue #1763 HTTP API endpoint aliases /json/* have been deprecated
plugin_dir anymoregcache.page_size for replication clusters without tables or with empty tables; also fixed saving and loading of the Galera options.searchd.agent_* but with different defaults.show variables.debugv verbosity level.index_exact_words between indexing and loading the table to the daemon.=term of full-text query with the morphology_skip_fields field.data_dir affect the current work directory on daemon start.agent_query_timeout being replaced by the default query option agent_query_timeout.packedfactors() with multiple values per match.min_prefix_len / min_infix_len.SPH_EXTNODE_STACK_SIZE value.FACET error when querying a distributed table with agent and local tables.CREATE TABLE wasn't failing in case of a missing wordforms file.SPH_SORT_ATTR_DESC and SPH_SORT_ATTR_ASC.Expect: 100-continue HTTP header for curl requests to Buddy./search.ALTER CLUSTER ADD and JOIN CLUSTER operations to wait for each other, preventing a race condition where ALTER adds a table to the cluster while the donor sends tables to the joiner node.UNFREEZE wasn't working in some casessignal 11 when inserting dataFREEZE counter to avoid freeze/unfreeze issues.max_query_time could be not working in some cases.SecondaryIndex CBO hintexpansion_limit to slice final result set for call keywords from multiple disk chunks or RAM chunks.Issue #97 Set VIP HTTP port as default when available
Various improvements: improved versions check and streaming ZSTD decompression; added user prompts for version mismatches during restore; fixed incorrect prompting behavior for different versions on restore; enhanced decompression logic to read directly from the stream rather than into working memory; added --force flag
Commit 3b35 Added backup version display after Manticore search start to identify issues at this stage
HAVING.expr ranker on using columnar attribute.manticore.conf.shReleased: August 23rd 2023
Version 6.2.12 continues the 6.2 series and addresses issues discovered after the release of 6.2.0.
TimeoutStartSec from infinity to 0 for better compatibility with Centos 7.searchdreplication.cpp: beggining -> beginning.Thd_t build issue on Windows related to atomic copy restrictions.ColumnarScan.AF_INET error in the test./bulk endpoints in the manual.Released: August 4th 2023
mysqldumpWe've started using GitHub workflows, making it simpler for contributors to utilize the same Continuous Integration (CI) process that the core team applies when preparing packages. All jobs can be run on GitHub-hosted runners, which facilitates seamless testing of changes in your fork of Manticore Search.
pseudo_sharding has been adjusted to be limited to the number of free threads. This update considerably enhances the throughput performance./json/pq HTTP endpoint.upper() and lower().count(*) queries, a precalculated value is now returned.SELECT for making arbitrary calculations and displaying @@sysvars. Unlike before, you are no longer limited to just one calculation. Therefore, queries like select user(), database(), @@version_comment, version(), 1+1 as a limit 10 will return all the columns. Note that the optional 'limit' will always be ignored.CREATE DATABASE stub query.ALTER TABLE table REBUILD SECONDARY, secondary indexes are now always rebuilt, even if attributes weren't updated.SELECT DATABASE() command. However, it will always return Manticore. This addition is crucial for integrations with various MySQL tools./cli_json endpoint to function as the previous /cli.thread_stack can now be altered during runtime using the SET statement. Both session-local and daemon-wide variants are available. Current values can be accessed in the show variables output.SHOW STATUS command.DESC and SHOW CREATE TABLE now match that of SELECT * FROM.P01) during various errors. This enhancement aids in identifying which parser caused an error and also obscures non-essential internal details.sentence to show the entire sentencestrftime() function./bulk endpoint reports information regarding the number of processed and non-processed strings (documents) in case of an error.CREATE TABLE operation can run at a time.Get call, replacing the previous two-step AdvanceTo + Get calls to retrieve a value.CheckReplaceEntry call was removed from the group sorter to expedite the calculation of aggregate functions.CREATE TABLE options read_buffer_docs and read_buffer_hits now support k/m/g syntax.apt/yum install manticore-language-packs. On macOS, use the command brew install manticoresoftware/tap/manticore-language-packs.SHOW CREATE TABLE and DESC operations.INSERT queries, new INSERT queries will fail until enough disk space becomes available./bulk endpoint now processes empty lines as a commit command. More info here.count(*) is used with a single filter, queries now leverage precalculated data from secondary indexes when available, substantially speeding up query times./*+ SecondaryIndex(uid) */. Please note that the old syntax is no longer supported.@ in table names has been disallowed to prevent syntax conflicts.indexed and attribute are now regarded as a single field during INSERT, DESC, and ALTER operations.manticore.json config..sph files could be corrupted ALTER. Fixed.pre_commit error occurring when replace is replicated from multiple master nodes.pseudo_sharding was disabled.show index status command has been modified and now varies depending on the type of index in use.expand_keywords option.SNIPPETS() was called.not_terms_only_allowed option to RT index with killed documents.FEDERATED engine with aggregate.rt_attr_json column was incompatible with columnar storage.ignore_chars.--dumpdocids command.morphology_skip_fields.max_packet_size check for replication commands between nodes. Additionally, the latest cluster error has been added to the status display.MANTICORE_BUDDY_TIMEOUT (default 3 seconds) to control the daemon's wait duration for a buddy message at startup.SHOW CREATE TABLE.SNIPPET() function.all()/any() is logged.Released: March 15 2023
Added handling of bulk requests in Elasticsearch-like format.
Buddy commit ce90 Log Buddy version on Manticore start.
/pq HTTP endpoint to be an alias of the /json/pq HTTP endpoint.Released: Feb 10 2023
Released: Feb 7 2023
Starting with this release, Manticore Search comes with Manticore Buddy, a sidecar daemon written in PHP that handles high-level functionality that does not require super low latency or high throughput. Manticore Buddy operates behind the scenes, and you may not even realize it is running. Although it is invisible to the end user, it was a significant challenge to make Manticore Buddy easily installable and compatible with the main C++-based daemon. This major change will allow the team to develop a wide range of new high-level features, such as shards orchestration, access control and authentication, and various integrations like mysqldump, DBeaver, Grafana mysql connector. For now it already handles SHOW QUERIES, BACKUP and Auto schema.
This release also includes more than 130 bug fixes and numerous features, many of which can be considered major.
SET GLOBAL ES_COMPAT=off.Commit 2b95 Added CBO hints for fine-tuning its behaviour.
Telemetry: we are excited to announce the addition of telemetry in this release. This feature allows us to collect anonymous and depersonalized metrics that will help us improve the performance and user experience of our product. Rest assured, all data collected is completely anonymous and will not be linked to any personal information. This feature can be easily turned off in the settings if desired.
when you did UPDATE (i.e. in-place update, not replace) of an attribute in the index
Issue #821 New tool manticore-backup for backing up and restoring Manticore instance
KILL to kill a long-running SELECT.max_matches for aggregation queries to increase accuracy and lower response time.accurate_aggregation and max_matches_increase_threshold for controlled aggregation accuracy.index. To reduce confusion, we are renaming the latter to "table". The following SQL/command line commands are affected by this change. Their old versions are deprecated, but still functional:index <table name> => table <table name>,searchd -i / --index => searchd -t / --table,SHOW INDEX STATUS => SHOW TABLE STATUS,SHOW INDEX SETTINGS => SHOW TABLE SETTINGS,FLUSH RTINDEX => FLUSH TABLE,OPTIMIZE INDEX => OPTIMIZE TABLE,ATTACH TABLE plain TO RTINDEX rt => ATTACH TABLE plain TO TABLE rt,RELOAD INDEX => RELOAD TABLE,RELOAD INDEXES => RELOAD TABLES.We are not planning to make the old forms obsolete, but to ensure compatibility with the documentation, we recommend changing the names in your application. What will be changed in a future release is the "index" to "table" rename in the output of various SQL and JSON commands.
searchd.secondary_indexes = 1 in your configuration file, be aware that the new Manticore version will skip loading the tables that have secondary indexes. It's recommended to:searchd.secondary_indexes to 0 in the configuration file.ALTER TABLE <table name> REBUILD SECONDARY for each index to rebuild secondary indexes.If you are running a replication cluster, you'll need to run ALTER TABLE <table name> REBUILD SECONDARY on all the nodes or follow this instruction with just change: run the ALTER .. REBUILD SECONDARY instead of the OPTIMIZE.
/var/lib/manticore/binlog/ except for binlog.meta after stopping the previous instance.SHOW SETTINGS: you can now see the settings from the configuration file from inside Manticore.dump_corrupt_meta enables dumping a corrupted table meta data to log in case searchd can't load the index.DEBUG META can show max_matches and pseudo sharding statistics.--new-cluster (run tool manticore_new_cluster in Linux).select attr, count(*) from plain_index (w/o filtering) are now faster in case you are using MCL.brew install manticoresoftware/manticore/manticoresearch manticoresoftware/manticore/manticore-extra.textbinlog_flush = 1 has been broken all the time since Sphinx. Fixed.got exception while reading ist stream: mkstemp(./gmb_pF6TJi) failed: 13 (Permission denied) if the searchd was started from a directory it can't write to.Released: May 30th 2022
Released: May 18th 2022
secondary_indexes = 1 either in your configuration file or using SET GLOBAL. The new functionality is supported in all operating systems except old Debian Stretch and Ubuntu Xenial.a=1 and (b=2 or c=3) in JSON: must (AND), should (OR) and must_not (NOT) worked only on the highest level. Now they can be nested.Content-Length is unnecessary). On the server side, Manticore now always processes incoming HTTP data in a streaming manner, without waiting for the entire batch to be transferred as before, which:allows you to bypass max_packet_size and transfer batches much larger than the maximum allowed value of max_packet_size (128MB), for example, 1GB at a time.
#719 HTTP interface support of 100 Continue: now you can transfer large batches from curl (including curl libraries used by various programming languages) which by default does Expect: 100-continue and waits some time before actually sending the batch. Previously you had to add Expect: header, now it's not needed.
pseudo_sharding = 0 to section searchd of your Manticore configuration file.select * from <columnar table> are now much faster than previously, especially if there are many fields in the schema.total_found in SHOW META and hits.total in JSON output. It is now only accurate in case you see total_relation: eq while total_relation: gte means the actual number of matching documents is greater than the total_found value you've got. To retain the previous behaviour you can use search option cutoff=0, which makes total_relation always eq.stored_fields = (empty value) to make all fields non-stored (i.e. revert to the previous behaviour)..meta, .sph) were in binary format, now it's just json. The new Manticore version will convert older indexes automatically, but:WARNING: ... syntax error, unexpected TOK_IDENTyou won't be able to run the index with previous Manticore versions, make sure you have a backup
⚠️ BREAKING CHANGE: Session state support with help of HTTP keep-alive. This makes HTTP stateful when the client supports it too. For example, using the new /cli endpoint and HTTP keep-alive (which is on by default in all browsers) you can call SHOW META after SELECT and it will work the same way it works via mysql. Note, previously Connection: keep-alive HTTP header was supported too, but it only caused reusing the same connection. Since this version it also makes the session stateful.
columnar_attrs = * to define all your attributes as columnar in the plain mode which is useful in case the list is long.--new-cluster (run tool manticore_new_cluster in Linux).read about restarting a cluster for more details.
Replication improvements:
Improved logging
Security improvement: Manticore now listens on 127.0.0.1 instead of 0.0.0.0 in case no listen at all is specified in config. Even though in the default configuration which is shipped with Manticore Search the listen setting is specified and it's not typical to have a configuration with no listen at all, it's still possible. Previously Manticore would listen on 0.0.0.0 which is not secure, now it listens on 127.0.0.1 which is usually not exposed to the Internet.
AVG() accuracy: previously Manticore used float internally for aggregations, now it uses double which increases the accuracy significantly.DEBUG malloc_stats support for jemalloc.sphinxql by default. If you are used to plain format you need to add query_log_format = plain to your configuration file.max_connections limit, which could cause "maxed out" error for non-VIP connections. Now VIP connections are not counted towards the limit. Current number of VIP connections can be also seen in SHOW STATUS and status./sql?mode=raw now requires escaping and returns an array./bulk INSERT/REPLACE/DELETE requests:now the whole batch is considered a single transaction, which returns a single response
⚠️ Search options low_priority and boolean_simplify now require a value (0/1): previously you could do SELECT ... OPTION low_priority, boolean_simplify, now you need to do SELECT ... OPTION low_priority=1, boolean_simplify=1.
query_log_format=sphinxql. Previously only full-text part was logged, now it's logged as is.yum remove manticore*Debian and Ubuntu: apt remove manticore*
New deb/rpm packages structure. Previous versions provided:
manticore-server with searchd (main search daemon) and all needed for itmanticore-tools with indexer and indextoolmanticore including everythingmanticore-all RPM as a meta package referring to manticore-server and manticore-toolsThe new structure is:
- manticore - deb/rpm meta package which installs all the above as dependencies
- manticore-server-core - searchd and everything to run it alone
- manticore-server - systemd files and other supplementary scripts
- manticore-tools - indexer, indextool and other tools
- manticore-common - default configuration file, default data directory, default stopwords
- manticore-icudata, manticore-dev, manticore-converter didn't change much
- .tgz bundle which includes all the packages
application/x-ndjsonranker could be specified twice in query log
root@perf3 ~ # mysql -P9306 -h0 -e "drop table if exists pq; create table pq (f text, f2 text, j json, s string) type='percolate';"; date; for m in `seq 1 1000`; do (echo -n "insert into pq (id,query,filters,tags) values "; for n in `seq 1 1000`; do echo -n "(0,'@f (cat | ( angry dog ) | (cute mouse)) @f2 def', 'j.json.language=\"en\"', '{\"tag1\":\"tag1\",\"tag2\":\"tag2\"}')"; [ $n != 1000 ] && echo -n ","; done; echo ";")|mysql -P9306 -h0; done; date; mysql -P9306 -h0 -e "select count(*) from pq"
Wed Dec 22 10:24:30 AM CET 2021
Wed Dec 22 10:25:18 AM CET 2021
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
root@perf3 ~ # date; (echo "begin;"; for offset in `seq 0 10000 30000`; do n=0; echo "replace into pq (id,query,filters,tags) values "; for id in `mysql -P9306 -h0 -NB -e "select id from pq limit $offset, 10000 option max_matches=1000000"`; do echo "($id,'@f (tiger | ( angry bear ) | (cute panda)) @f2 def', 'j.json.language=\"de\"', '{\"tag1\":\"tag1\",\"tag2\":\"tag2\"}')"; n=$((n+1)); [ $n != 10000 ] && echo -n ","; done; echo ";"; done; echo "commit;") > /tmp/replace.sql; date
Wed Dec 22 10:26:23 AM CET 2021
Wed Dec 22 10:26:27 AM CET 2021
root@perf3 ~ # time mysql -P9306 -h0 < /tmp/replace.sql
real 6m46.195s
user 0m0.035s
sys 0m0.008s
root@perf3 ~ # mysql -P9306 -h0 -e "drop table if exists pq; create table pq (f text, f2 text, j json, s string) type='percolate';"; date; for m in `seq 1 1000`; do (echo -n "insert into pq (id,query,filters,tags) values "; for n in `seq 1 1000`; do echo -n "(0,'@f (cat | ( angry dog ) | (cute mouse)) @f2 def', 'j.json.language=\"en\"', '{\"tag1\":\"tag1\",\"tag2\":\"tag2\"}')"; [ $n != 1000 ] && echo -n ","; done; echo ";")|mysql -P9306 -h0; done; date; mysql -P9306 -h0 -e "select count(*) from pq"
Wed Dec 22 10:06:38 AM CET 2021
Wed Dec 22 10:07:12 AM CET 2021
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
root@perf3 ~ # date; (echo "begin;"; for offset in `seq 0 10000 990000`; do n=0; echo "replace into pq (id,query,filters,tags) values "; for id in `mysql -P9306 -h0 -NB -e "select id from pq limit $offset, 10000 option max_matches=1000000"`; do echo "($id,'@f (tiger | ( angry bear ) | (cute panda)) @f2 def', 'j.json.language=\"de\"', '{\"tag1\":\"tag1\",\"tag2\":\"tag2\"}')"; n=$((n+1)); [ $n != 10000 ] && echo -n ","; done; echo ";"; done; echo "commit;") > /tmp/replace.sql; date
Wed Dec 22 10:12:31 AM CET 2021
Wed Dec 22 10:14:00 AM CET 2021
root@perf3 ~ # time mysql -P9306 -h0 < /tmp/replace.sql
real 0m23.248s
user 0m0.891s
sys 0m0.047s
searchd. It's useful when you want to limit the RT chunks count in all your indexes to a particular number globally.YEAR() and other timestamp functions.rt_mem_limit of data before saving a new disk chunk to disk, and while saving was still collecting up to 10% more (aka double-buffer) to minimize possible insert suspension. If that limit was also exhausted, adding new documents was blocked until the disk chunk was fully saved to disk. The new adaptive limit is built on the fact that we have auto-optimize now, so it's not a big deal if disk chunks do not fully respect rt_mem_limit and start flushing a disk chunk earlier. So, now we collect up to 50% of rt_mem_limit and save that as a disk chunk. Upon saving we look at the statistics (how much we've saved, how many new documents have arrived while saving) and recalculate the initial rate which will be used next time. For example, if we saved 90 million documents, and another 10 million docs arrived while saving, the rate is 90%, so we know that next time we can collect up to 90% of rt_mem_limit before starting flushing another disk chunk. The rate value is calculated automatically from 33.3% to 95%.indexer -v and --version. Previously you could still see indexer's version, but -v/--version were not supported.MANTICORE_TRACK_RT_ERRORS useful for debugging RT segments corruption./var/lib/manticore/binlog/ except binlog.meta after stopping the previous instance.show threads option format=all. It shows stack of some task info tickets, most useful for profiling needs, so if you are parsing show threads output be aware of the new column.searchd.workers was obsoleted since 3.5.0, now it's deprecated, if you still have it in your configuration file it will trigger a warning on start. Manticore Search will start, but with a warning.PDO::ATTR_EMULATE_PREPARESindextool --check could crashINSERT, REPLACE, DELETE, OPTIMIZEALTERindextool --checkname text, email string, description text, age int, active bit(1) (default rt_mem_limit, batch size 25000, 16 concurrent insert workers, 16 million docs inserted overall). In 4.0.2 the same concurrency/batch/count gives 357K docs per second*.CPU cores lower response time for non-full-text search queries. Note it can easily occupy all existing CPU cores, so if you care not only about latency, but throughput too - use it with caution.time curl -X POST -d '{"update":{"index":"idx","id":4611686018427387905,"doc":{"mode":0}}}' -H "Content-Type: application/x-ndjson" http://127.0.0.1:6358/json/bulk
real 0m43.783s
user 0m0.008s
sys 0m0.007s
time curl -X POST -d '{"update":{"index":"idx","id":4611686018427387905,"doc":{"mode":0}}}' -H "Content-Type: application/x-ndjson" http://127.0.0.1:6358/json/bulk
real 0m0.006s
user 0m0.004s
sys 0m0.001s
--replay-flags=ignore-trx-errors and --replay-flags=ignore-all-errors so one can still start searchd if the binlog is corruptedcharset_table's default value changes from 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451 to non_cjkOPTIMIZE happens automatically. If you don't need it make sure to set auto_optimize=0 in section searchd in the configuration fileondisk_attrs_default were deprecated, now they are removedtotal in SHOW META, but not total_found which is the actual number of found documents./var/lib/manticore/binlog/ (only binlog.meta should be in the directory)--new-cluster (run tool manticore_new_cluster in Linux).ERROR 1064 (42000): invalid GTID, (null), the donor could become unresponsive while another node was joiningindextool --help doesn't display parameter --rotatecommand_insert, command_replace and others were showing wrong metricscharset_table for a plain index had a wrong default valueSELECT * FROM pq ORDER BY id desc LIMIT 1000 , 100 OPTION max_matches=1100 was not working previouslyMaintenance release before Manticore 4
manticore_new_cluster [--force] useful for restarting a replication cluster via systemdindexer --mergeblend_mode='trim_all'WHERE json.a = 1DEBUG SPLIT as a prerequisite for automatic sharding/rebalancingindextool --dumpheaderreverse_scan is deprecated. Make sure you don't use this option in your queries since 3.6.0 since they will fail otherwisereverse_scan has been deprecatedindexer --all and have not only plain indexes in the configuration file. Without ignore_non_plain=1 you'll get a warning and a respective exit code.indexer --verbose is deprecated as it never added anything to the indexer outputUSR2 is now to be used instead of USR12* No. of cores) instead of a single one. The optimal number of chunks can be controlled by cutoff option.0.New setting max_threads_per_query sets how many threads a query can use. If the directive is not set, a query can use threads up to the value of threads.
Per SELECT query the number of threads can be limited with OPTION threads=N overriding the global max_threads_per_query.
Percolate indexes can be now be imported with IMPORT TABLE.
/search receives basic support for faceting/grouping by new query node aggs.listen=...:sphinx needs to be explicit set for SphinxSE connections or SphinxAPI clients.killed_documents, killed_rate, disk_mapped_doclists, disk_mapped_cached_doclists, disk_mapped_hitlists and disk_mapped_cached_hitlists.status now outputs Queue\Threads and Tasks\Threads.dist_threads is completely deprecated now, searchd will log a warning if the directive is still used.The official Docker image is now based on Ubuntu 20.04 LTS
Besides the usual manticore package, you can also install Manticore Search by components:
manticore-server-core - provides searchd, manpage, log dir, API and galera module. It will also install manticore-common as the dependency.manticore-server - provides automation scripts for core (init.d, systemd), and manticore_new_cluster wrapper. It will also install manticore-server-core as the dependency.manticore-common - provides config, stopwords, generic docs and skeleton folders (datadir, modules, etc.)manticore-tools - provides auxiliary tools ( indexer, indextool etc.), their manpages and examples. It will also install manticore-common as the dependency.manticore-icudata (RPM) or manticore-icudata-65l (DEB) - provides ICU data file for icu morphology usage.manticore-devel (RPM) or manticore-dev (DEB) - provides dev headers for UDFs.highlight({},'field1, field2') or highlight in json queries) now applies limits per-field by default.highlight({}, string_attr) or snippet() now applies limits to the whole document.limits_per_field=0 option (1 by default).allow_empty is now 0 by default for highlighting via HTTP JSON.
The same port can now be used for http, https and binary API (to accept connections from a remote Manticore instance). listen = *:mysql is still required for connections via mysql protocol. Manticore now detects automatically the type of client trying to connect to it except for MySQL (due to restrictions of the protocol).
In RT mode a field can now be text and string attribute at the same time - GitHub issue #331.
In plain mode it's called sql_field_string. Now it's available in RT mode for real-time indexes too. You can use it as shown in the example:
```sql
create table t(f string attribute indexed);
insert into t values(0,'abc','abc');
select * from t where match('abc');
+---------------------+------+
| id | f |
+---------------------+------+
| 2810845392541843463 | abc |
+---------------------+------+
1 row in set (0.01 sec)
mysql> select * from t where f='abc';
+---------------------+------+
| id | f |
+---------------------+------+
| 2810845392541843463 | abc |
+---------------------+------+
1 row in set (0.00 sec)
```
status command.in is now available via HTTP JSON interface.expressions in HTTP JSON.rt_mem_limit on the fly in RT mode, i.e. can do ALTER ... rt_mem_limit=<new value>.chinese, japanese and korean.SHOW THREADS output.CALL PQ in SHOW THREADS.SET [GLOBAL] wait_timeout=NUM implemented ,INSERT INTO PQ VALUES() (i.e. without providing column list) previously expected exactly (query, tags) as the values. It's been changed to (id,query,tags,filters). The id can be set to 0 if you want it to be auto-generated.allow_empty=0 is a new default in highlighting via HTTP JSON interface.CREATE TABLE/ALTER TABLE.ram_chunks_count was renamed to ram_chunk_segments_count in SHOW INDEX STATUS.workers is obsolete. There's only one workers mode now.dist_threads is obsolete. All queries are as much parallel as possible now (limited by threads and jobs_queue_size).max_children is obsolete. Use threads to set the number of threads Manticore will use (set to the # of CPU cores by default).queue_max_length is obsolete. Instead of that in case it's really needed use jobs_queue_size to fine-tune internal jobs queue size (unlimited by default)./json/* endpoints are now available w/o /json/, e.g. /search, /insert, /delete, /pq etc.field meaning "full-text field" was renamed to "text" in describe.3.4.2:
sql
mysql> describe t;
+-------+--------+----------------+
| Field | Type | Properties |
+-------+--------+----------------+
| id | bigint | |
| f | field | indexed stored |
+-------+--------+----------------+
3.5.0:
sql
mysql> describe t;
+-------+--------+----------------+
| Field | Type | Properties |
+-------+--------+----------------+
| id | bigint | |
| f | text | indexed stored |
+-------+--------+----------------+
и doesn't map to i in non_cjk charset_table (which is a default) as it affected Russian stemmers and lemmatizers too much.read_timeout. Use network_timeout instead which controls both reading and writing.manticore-bin to manticorecount(*) shows different numbers/json/replace and json/update return id in exponent formhitless_words doesn't work in RT modeALTER RECONFIGURE in rt mode should be disallowedrt_mem_limit gets reset to 128M after searchd restartSHOW CREATE TABLE vs multiple wordform filesSHOW CREATE TABLE doesn't work for PQCREATE TABLE LIKE doesn't work properly for PQCREATE TABLE LIKE infix errorALTER reconfigure corrupts a PQ indexHIGHLIGHT() doesn't higlight in string attributesFACET fails to sort on string attributeCALL PQ returns "Bad JSON objects in strings: 1" when the json is greater than some value.max_xmlpipe2_field = 2M returned warning on 2M field[null] in json attr in centos 7 causes corrupted inserted data/sql HTTP endpoint response is now the same as /json/search responseaccess_plain_attrs, access_blob_attrs, access_doclists, access_hitlistsserver_id for replication setups/search endpointread_buffer, ondisk_attrs, ondisk_attrs_default, mlock are replaced by access_* directivesCmake minimum version is now 3.13. Compiling requires boost and libssl
development libraries.
format=sphinxql prints all queries in SQL formatclone_attrs stageshutdown using DEBUG commanddocs_id option for documents called in CALL PQ.SET wait_timeout (for better ProxySQL compatibility)-DUSE_JEMALLOC=1In this release we've changed internal protocol used by masters and agents to speak with each other. In case you run Manticoresearch in a distributed environment with multiple instances make sure your first upgrade agents, then the masters.
multiplier row when multi-query optimization is usedManticore Search is built using cmake and the minimum gcc version required for compiling is 4.7.2.
manticore user./var/lib/manticore/./var/log/manticore/./var/run/manticore/.Unfortunately, Manticore is not yet 100% bug-free, although the development team is working hard towards that goal. You may encounter some issues from time to time.
It is crucial to report as much information as possible about each bug to fix it effectively.
To fix a bug, either it needs to be reproduced and fixed or its cause needs to be deduced based on the information you provide. To help with this, please follow the instructions below.
Bugs and feature requests are tracked on Github. You are welcome to create a new ticket and describe your bug in detail to save time for both you and the developers.
Updates to the documentation (what you are reading now) are also done on Github.
Manticore Search is written in C++, which is a low-level programming language that allows for direct communication with the computer for faster performance. However, there is a drawback to this approach as in rare cases, it may not be possible to elegantly handle a bug by writing an error to a log and skipping the processing of the command that caused the problem. Instead, the program may crash, resulting in it stopping completely and needing to be restarted.
When Manticore Search crashes, it is important to let the Manticore team know by submitting a bug report on GitHub or through Manticore's professional services in your private helpdesk. The Manticore team requires the following information:
Additionally, it would be helpful if you could do the following:
1. Run gdb to inspect the coredump:
gdb /usr/bin/searchd </path/to/coredump>
%p in /proc/sys/kernel/core_pattern), e.g. core.work_6.29050.server_name.1637586599 means thread_id=29050set pagination off
info threads
# find thread number by it's id (e.g. for `LWP 29050` it will be thread number 8
thread apply all bt
thread <thread number>
bt full
info locals
quit
If Manticore Search hangs, you need to collect some information that may be useful in understanding the cause. Here's how you can do it:
show threads option format=all trough a VIP portlsof -p `cat /var/run/manticore/searchd.pid`
gcore `cat /var/run/manticore/searchd.pid`
(It will save the dump to the current directory.)
gdb /usr/bin/searchd `cat /var/run/manticore/searchd.pid`
Note that this will halt your running searchd, but if it's already hanging, it shouldn't be a problem.
5. In gdb run:
set pagination off
info threads
thread apply all bt
quit
For experts: the macros added in this commit can be helpful in debugging.
--coredump. To avoid modifying scripts, you can use the https://manual.manticoresearch.com/Starting_the_server/Linux#Custom-startup-flags-using-systemd , method. For example::[root@srv lib]# systemctl set-environment _ADDITIONAL_SEARCHD_PARAMS='--coredump'
[root@srv lib]# systemctl restart manticore
[root@srv lib]# ps aux|grep searchd
mantico+ 1955 0.0 0.0 61964 1580 ? S 11:02 0:00 /usr/bin/searchd --config /etc/manticoresearch/manticore.conf --coredump
mantico+ 1956 0.6 0.0 392744 2664 ? Sl 11:02 0:00 /usr/bin/searchd --config /etc/manticoresearch/manticore.conf --coredump
/proc/sys/kernel/core_pattern is not empty. This is the location where the core dumps will be saved. To save core dumps to a file such as core.searchd.1773.centos-4gb-hel1-1.1636454937, run the following command:echo "/cores/core.%e.%p.%h.%t" > /proc/sys/kernel/core_pattern
ulimit -c unlimited. If you start Manticore using systemctl, it will automatically set the limit to infinity as indicated by the following line in the manticore.service file:[root@srv lib]# grep CORE /lib/systemd/system/manticore.service
LimitCORE=infinity
Manticore Search and Manticore Columnar Library are written in C++, which results in compiled binary files that execute optimally on your operating system. However, when running a binary, your system does not have full access to the names of variables, functions, methods, and classes. This information is provided in separate "debuginfo" or "symbol packages."
Debug symbols are essential for troubleshooting and debugging, as they allow you to visualize the system state when it crashed, including the names of functions. Manticore Search provides a backtrace in the searchd log and generates a coredump if run with the --coredump flag. Without symbols, all you will see is internal offsets, making it difficult or impossible to decode the cause of the crash. If you need to make a bug report about a crash, the Manticore team will often require debug symbols to assist you.
To install Manticore Search/Manticore Columnar Library debug symbols, you will need to install the *debuginfo* package for CentOS, the *dbgsym* package for Ubuntu and Debian, or the *dbgsymbols* package for Windows and macOS. These packages should be the same version as the installed Manticore. For example, if you installed Manticore Search in Centos 8 from the package https://repo.manticoresearch.com/repository/manticoresearch/release/centos/8/x86_64/manticore-4.0.2_210921.af497f245-1.el8.x86_64.rpm , the corresponding package with symbols would be https://repo.manticoresearch.com/repository/manticoresearch/release/centos/8/x86_64/manticore-debuginfo-4.0.2_210921.af497f245-1.el8.x86_64.rpm
Note that both packages have the same commit id af497f245, which corresponds to the commit that this version was built from.
If you have installed Manticore from a Manticore APT/YUM repository, you can use one of the following tools:
debuginfo-install in CentOS 7dnf debuginfo-install CentOS 8find-dbgsym-packages in Debian and Ubuntuto find a debug symbols package for you.
file /usr/bin/searchd:[root@srv lib]# file /usr/bin/searchd
/usr/bin/searchd: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=2c582e9f564ea1fbeb0c68406c271ba27034a6d3, stripped
In this case, the build ID is 2c582e9f564ea1fbeb0c68406c271ba27034a6d3.
/usr/lib/debug/.build-id like this:[root@srv ~]# ls -la /usr/lib/debug/.build-id/2c/582e9f564ea1fbeb0c68406c271ba27034a6d3*
lrwxrwxrwx. 1 root root 23 Nov 9 10:42 /usr/lib/debug/.build-id/2c/582e9f564ea1fbeb0c68406c271ba27034a6d3 -> bin/searchd
lrwxrwxrwx. 1 root root 27 Nov 9 10:42 /usr/lib/debug/.build-id/2c/582e9f564ea1fbeb0c68406c271ba27034a6d3.debug -> usr/bin/searchd.debug
To fix your bug, developers often need to reproduce it locally. To do this, they need your configuration file, table files, binlog (if present), and sometimes source data (such as data from external storages or XML/CSV files) and queries.
Attach your data when you create a ticket on Github. If the data is too large or sensitive, you can upload it to our write-only S3 storage at s3://s3.manticoresearch.com/write-only/. Here's how you can do it using the Minio client:
1. Install the client https://min.io/docs/minio/linux/reference/minio-mc.html#install-mc
For example on 64-bit Linux:
curl https://dl.min.io/client/mc/release/linux-amd64/mc \
--create-dirs \
-o $HOME/minio-binaries/mc
chmod +x $HOME/minio-binaries/mc
export PATH=$PATH:$HOME/minio-binaries/
cd $HOME/minio-binaries and then ./mc config host add manticore http://s3.manticoresearch.com:9000 manticore manticorecd $HOME/minio-binaries and then ./mc cp -r issue-1234/ manticore/write-only/issue-1234 . Make sure the folder name is unique and best if it corresponds to the issue on GitHub where you described the bug.DEBUG [ subcommand ]
The DEBUG statement is designed for developers and testers to call various internal or VIP commands. However, it is not intended for production use as the syntax of the subcommand component may change freely in any build.
To view a list of useful commands and DEBUG statement subcommands available in the current context, simply call DEBUG without any parameters.
mysql> debug;
+-------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
| command | meaning |
+-------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
| flush logs | emulate USR1 signal |
| reload indexes | emulate HUP signal |
| debug token <password> | calculate token for password |
| debug malloc_stats | perform 'malloc_stats', result in searchd.log |
| debug malloc_trim | pefrorm 'malloc_trim' call |
| debug sleep <N> | sleep for <N> seconds |
| debug tasks | display global tasks stat (use select from @@system.tasks instead) |
| debug sched | display task manager schedule (use select from @@system.sched instead) |
| debug merge <TBL> [chunk] <X> [into] [chunk] <Y> [option sync=1,byid=0] | For RT table <TBL> merge disk chunk X into disk chunk Y |
| debug drop [chunk] <X> [from] <TBL> [option sync=1] | For RT table <TBL> drop disk chunk X |
| debug files <TBL> [option format=all|external] | list files belonging to <TBL>. 'all' - including external (wordforms, stopwords, etc.) |
| debug close | ask server to close connection from it's side |
| debug compress <TBL> [chunk] <X> [option sync=1] | Compress disk chunk X of RT table <TBL> (wipe out deleted documents) |
| debug split <TBL> [chunk] <X> on @<uservar> [option sync=1] | Split disk chunk X of RT table <TBL> using set of DocIDs from @uservar |
| debug wait <cluster> [like 'xx'] [option timeout=3] | wait <cluster> ready, but no more than 3 secs. |
| debug wait <cluster> status <N> [like 'xx'] [option timeout=13] | wait <cluster> commit achieve <N>, but no more than 13 secs |
| debug meta | Show max_matches/pseudo_shards. Needs set profiling=1 |
| debug trace OFF|'path/to/file' [<N>] | trace flow to file until N bytes written, or 'trace OFF' |
| debug curl <URL> | request given url via libcurl |
+-------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
19 rows in set (0.00 sec)
Same from VIP connection:
mysql> debug;
+-------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
| command | meaning |
+-------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
| flush logs | emulate USR1 signal |
| reload indexes | emulate HUP signal |
| debug shutdown <password> | emulate TERM signal |
| debug crash <password> | crash daemon (make SIGSEGV action) |
| debug token <password> | calculate token for password |
| debug malloc_stats | perform 'malloc_stats', result in searchd.log |
| debug malloc_trim | pefrorm 'malloc_trim' call |
| debug procdump | ask watchdog to dump us |
| debug setgdb on|off | enable or disable potentially dangerous crash dumping with gdb |
| debug setgdb status | show current mode of gdb dumping |
| debug sleep <N> | sleep for <N> seconds |
| debug tasks | display global tasks stat (use select from @@system.tasks instead) |
| debug sched | display task manager schedule (use select from @@system.sched instead) |
| debug merge <TBL> [chunk] <X> [into] [chunk] <Y> [option sync=1,byid=0] | For RT table <TBL> merge disk chunk X into disk chunk Y |
| debug drop [chunk] <X> [from] <TBL> [option sync=1] | For RT table <TBL> drop disk chunk X |
| debug files <TBL> [option format=all|external] | list files belonging to <TBL>. 'all' - including external (wordforms, stopwords, etc.) |
| debug close | ask server to close connection from it's side |
| debug compress <TBL> [chunk] <X> [option sync=1] | Compress disk chunk X of RT table <TBL> (wipe out deleted documents) |
| debug split <TBL> [chunk] <X> on @<uservar> [option sync=1] | Split disk chunk X of RT table <TBL> using set of DocIDs from @uservar |
| debug wait <cluster> [like 'xx'] [option timeout=3] | wait <cluster> ready, but no more than 3 secs. |
| debug wait <cluster> status <N> [like 'xx'] [option timeout=13] | wait <cluster> commit achieve <N>, but no more than 13 secs |
| debug meta | Show max_matches/pseudo_shards. Needs set profiling=1 |
| debug trace OFF|'path/to/file' [<N>] | trace flow to file until N bytes written, or 'trace OFF' |
| debug curl <URL> | request given url via libcurl |
+-------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
24 rows in set (0.00 sec)
All debug XXX commands should be regarded as non-stable and subject to modification at any time, so don't be surprised if they change. This example output may not reflect the actual available commands, so try it on your system to see what is available on your instance. Additionally, there is no detailed documentation provided aside from this short 'meaning' column.
As a quick illustration, two commands available only to VIP clients are described below - shutdown and crash. Both require a token, which can be generated with the debug token subcommand, and added to the shutdown_token param in the searchd section of the config file. If no such section exists, or if the provided password hash does not match the token stored in the config, the subcommands will do nothing.
mysql> debug token hello;
+-------------+------------------------------------------+
| command | result |
+-------------+------------------------------------------+
| debug token | aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d |
+-------------+------------------------------------------+
1 row in set (0,00 sec)
The subcommand shutdown will send a TERM signal to the server, causing it to shut down. This can be dangerous, as nobody wants to accidentally stop a production service. Therefore, it requires a VIP connection and the password to be used.
The subcommand crash literally causes a crash. It may be used for testing purposes, such as to test how the system manager maintains the service's liveness or to test the feasibility of tracking coredumps.
If some commands are found to be useful in a more general context, they may be moved from the debug subcommands to a more stable and generic location (as exemplified by the debug tasks and debug sched in the table).
To be put to section common {} in configuration file:
indexer is a tool to create plain tables
To be put to section indexer {} in configuration file:
indexer [OPTIONS] [indexname1 [indexname2 [...]]]
index_converter is a tool designed to convert tables created with Sphinx/Manticore Search 2.x into the Manticore Search 3.x table format.
index_converter {--config /path/to/config|--path}
searchd is the Manticore server.
To be put in the searchd {} section of the configuration file:
--stopwait timeoutshutdown command from VIP SQL connectionload_files modesearchd [OPTIONS]
Assorted table maintenance features helpful for troubleshooting.
indextool <command> [options]
Utilized for dumping various debug information related to the physical table.
indextool <command> [options]
--checkSplits compound words into their components.
wordbreaker [-dict path/to/dictionary_file] {split|test|bench}
Extracts the contents of a dictionary file using ispell or MySpell format
spelldump [options] <dictionary> <affix> [result] [locale-name]
A comprehensive alphabetical list of keywords currently reserved in Manticore SQL syntax (thus, they cannot be used as identifiers).
AND, AS, BY, COLUMNARSCAN, DATE_ADD, DATE_SUB, DAY, DISTINCT, DIV, DOCIDINDEX, EXPLAIN, FACET, FALSE, FORCE, FROM, HOUR, IGNORE, IN, INTERVAL, INDEXES, INNER, IS, JOIN, KNN, LEFT, LIMIT, MINUTE, MOD, MONTH, NOT, NO_COLUMNARSCAN, NO_DOCIDINDEX, NO_SECONDARYINDEX, NULL, OFFSET, ON, OR, ORDER, QUARTER, REGEX, RELOAD, SECOND, SECONDARYINDEX, SELECT, SYSFILTERS, TRUE, WEEK, YEAR
To be put to section common {} in configuration file:
indexer is a tool to create plain tables
To be put to section indexer {} in configuration file:
indexer [OPTIONS] [indexname1 [indexname2 [...]]]
index_converter is a tool designed to convert tables created with Sphinx/Manticore Search 2.x into the Manticore Search 3.x table format.
index_converter {--config /path/to/config|--path}
searchd is the Manticore server.
To be put in the searchd {} section of the configuration file:
--stopwait timeoutshutdown command from VIP SQL connectionload_files modesearchd [OPTIONS]
Assorted table maintenance features helpful for troubleshooting.
indextool <command> [options]
Utilized for dumping various debug information related to the physical table.
indextool <command> [options]
--checkSplits compound words into their components.
wordbreaker [-dict path/to/dictionary_file] {split|test|bench}
Extracts the contents of a dictionary file using ispell or MySpell format
spelldump [options] <dictionary> <affix> [result] [locale-name]
A comprehensive alphabetical list of keywords currently reserved in Manticore SQL syntax (thus, they cannot be used as identifiers).
AND, AS, BY, COLUMNARSCAN, DATE_ADD, DATE_SUB, DAY, DISTINCT, DIV, DOCIDINDEX, EXPLAIN, FACET, FALSE, FORCE, FROM, HOUR, IGNORE, IN, INTERVAL, INDEXES, INNER, IS, JOIN, KNN, LEFT, LIMIT, MINUTE, MOD, MONTH, NOT, NO_COLUMNARSCAN, NO_DOCIDINDEX, NO_SECONDARYINDEX, NULL, OFFSET, ON, OR, ORDER, QUARTER, REGEX, RELOAD, SECOND, SECONDARYINDEX, SELECT, SYSFILTERS, TRUE, WEEK, YEAR